libkse -> libpthreads

Tue Apr 22 14:51:19 PDT 2003

I think your keyboard needs a rate limiter. ;-)

So the point of all this is that the vm performed better with processes
because it was better able to determine the resident set of each?

On Mon, 21 Apr 2003, Terry Lambert wrote:

> Jeff Roberson wrote:
> > On Mon, 21 Apr 2003, Terry Lambert wrote:
> > > It wouldn't.  The main issue as far as performance went, and why
> > > we (Novell USG) used processes instead of SVR4 threads, and did
> > > file descriptor table sharing, and shared client context data in
> > > a shared memory segment (8-)) is that SVR4-derived systems without
> > > a unified VM and buffer cache do a lot of page thrashing.
> >
> > Please explain how using processes instead of threads improves page
> > thrashing.
>
> SVR4.0.2 (Dell UNIX) and SVR4.2 (UnixWare 2.x) have a seperate VM
> and buffer cache.  Because of this, you tend to get page thrashing
> under any overload condition, even for nominally shared code pages,
> if you are doing a lot of data pages work.
>
> The problem is most easily seen in the UnixWare 1.x, prior to the
> introduction of the "fixed" scheduling class.  In order to put
> memory pressure on the system, in a UI-visible way, run X Windows,
> and then perform a compilation on a large project.  When the ld
> program is run, it will mmap() all of the .o files, and then randomly
> access them in quick succession, in order to perform symbol resolution
> for the large project.  When this happens, the UI will "lock up", and
> you will effectively lose the ability to move the mouse.  As you
> attempt to move the mouse, the mouse will not move, and it will trigger
> paging in of the X server, and then paging in of the application, both
> of whose code pages were forced out of core (and will be forced back
> out of core again, immediately) by the ld's access to data pages.  So
> once every one or two seconds or so, it will move, generate expose
> events, and lock up again.
>
> The net effect is the system appears to lock up, either completely,
> or for multiple seconds at a time.
>
> The UnixWare 2.x/SVR4.2 solution to this problem is to introduce a
> "fixed" scheduling class, so that a fixed percentage of the CPU time
> is dedicated to processing the X server code.  This doesn't stop it
> from being paged out, but it does provide a fixed amount of CPU to
> spend paging it back in, and then doing some processing on top of
> that (basically, I/O is accounted on a per process basis).  This
> was basically a lazy way of introducing a "precious" working set low
> watermark for the X server pages.  Much better to have established
> a per-file quota for the .o files themselves, so ld might thrash,
> but the only program that would get hurt by it would be ld.
>
>
> Now consider the specific case we were dealing with, which is the
> NetWare for UNIX (formerly Portable NetWare) problem.  In this case,
> if we were to use the SVR4/Solaris threading (this was after the
> merge of the Solaris and SVR4.2 code bases, as part of a joint
> project between Sun and USL, in which Sun got VM and FS code, and
> USL got the threads and some other code, in trade).
>
> The implementation paradigm for this code was as "anonymous work
> to do engines" -- essentially, the server consisted of a number of
> specific tasks (an intention mode transaction based long manager,
> a monitoring daemon, some miscellaneous tasks), and a number of
> identical tasks which implemented "work to do engines" -- all the
> latter tasks were identical, in that the client context for any
> client session was known to all of them.  As a result, any of these
> tasks could service any request.  Since the NCP packets are, with
> the sole exception of delayed lock grants, which are reported async
> via a covert channel, request/response in architecture, the number
> of concurrent client requests is limited by the number of tasks that
> are available to service the requests.  Our intent was to be able to
> service a large number of clients.
>
> Now consider that, while maximum concurrency was an issue, so was
> locality of file data sets, and locality of code pages, with the
> two contending for the limited available divided memory pools that
> were contended between the VM and buffer cache (effectively, there
> was a total set size, with a reserve held back for each type of
> pages, and the remaining pages were contended).
>
> Use of the "fixed" scheduling class was not an option.
>
> Using threads would not allow prefferential scheduling between the
> tasks, neither would it have allowed sharing of all client context
> (though it would have allowed descriptor sharing) without some form
> of marshalling and locking.  This is because a client that did not
> believe the server was responding "fast enough" would repeat the
> request.  It was necessary to respond to these clients with a
> "server busy" message.  The reason behind this because IPX is a
> unreliable datagram protocol, like UDP, and does not have a retry
> mechanism built into it.
>
> The upshot of this is that, with threads, the per process working
> set would be very large, and would be fragmented across the process
> address space.  This increased contention, well above what a process
> could withstand, without forcing VM pressure on the buffer cache.
> But the reason for the existance of the software was a *file* server,
> so this was unacceptable.
>
> By seperating the address spaces, this pressure was reduced, and
> the amount of overall contention was reduced, thus reducing the
> buffer cache pressure from the processes.
>
> In the limit, with all processes fully utilized (i.e. a request
> backlog at the stream MUX), it equalled out in performance.  In
> the common case, however, not all tasks were utilized all the
> time, and it was possible to allow them to be paged out.
>
> On top of this, there were a number of speed benefits to System V
> shared memory for the client contexts; if you have read "The Magic
> Garden Explained", these should be pretty obvious.  John Dyson
> made a number of similar changes to the FreeBSD implementation for
> Oracle Corporation, when Oracle was using FreeBSD as the basis of
> its "Network Computer" server.  Basically, the pages are VM pages
> only, with no write-through to the backing store; in the SVR4 case,
> this would have been buffer cache pages, backed by swap, if this
> were anonymous memory instead (the kind you get in a threads heap).
>
> Thus we come to part 2, which is that we modified the streams MUX
> to ensure that requests were assigned to engines as they entered
> the stream mux FD with a write+read request in LIFO, rather than
> FIFO order.  By doing this, were able to ensure that, most likely,
> the pages which were going to be requested were "in core" in the
> process making the request (performing default FIFO ordering would
> have resulted in a guarantee that the pages were not in core).  I
> dubbed this approach "hot engine scheduling".
>
> Attempting to use a similar approach in the threads case, besides
> the completely fragmented process memory that caused a much larger
> number of pages to need to be resident to do the same work, the
> MUX assignment of "work to do" would in fact have been effectively
> "random".
>
> Anything less than total utilization of the system was *worse* with
> random allocation of work units, and *better* with LIFO allocation.
>
>
> And that's why using processes instead of threads resulted in less
> page thrashing.
>
>
> There were, of course, other reasons for using processes, instead
> of threads, the primary among which was "better quantum utilization"
> (SVR4.2/UnixWare 2.0 did not support thread group affinity in the
> scheduler; as you have discovered, supporting that is NP-hard, unless
> you get tricky, and make migration explicit and initial selection
> intentional).
>
>
> -- Terry
>