libkse -> libpthreads

Mon Apr 21 23:52:52 PDT 2003

Jeff Roberson wrote:
> On Mon, 21 Apr 2003, Terry Lambert wrote:
> > It wouldn't.  The main issue as far as performance went, and why
> > we (Novell USG) used processes instead of SVR4 threads, and did
> > file descriptor table sharing, and shared client context data in
> > a shared memory segment (8-)) is that SVR4-derived systems without
> > a unified VM and buffer cache do a lot of page thrashing.
> 
> Please explain how using processes instead of threads improves page
> thrashing.

SVR4.0.2 (Dell UNIX) and SVR4.2 (UnixWare 2.x) have a seperate VM
and buffer cache.  Because of this, you tend to get page thrashing
under any overload condition, even for nominally shared code pages,
if you are doing a lot of data pages work.

The problem is most easily seen in the UnixWare 1.x, prior to the
introduction of the "fixed" scheduling class.  In order to put
memory pressure on the system, in a UI-visible way, run X Windows,
and then perform a compilation on a large project.  When the ld
program is run, it will mmap() all of the .o files, and then randomly
access them in quick succession, in order to perform symbol resolution
for the large project.  When this happens, the UI will "lock up", and
you will effectively lose the ability to move the mouse.  As you
attempt to move the mouse, the mouse will not move, and it will trigger
paging in of the X server, and then paging in of the application, both
of whose code pages were forced out of core (and will be forced back
out of core again, immediately) by the ld's access to data pages.  So
once every one or two seconds or so, it will move, generate expose
events, and lock up again.

The net effect is the system appears to lock up, either completely,
or for multiple seconds at a time.

The UnixWare 2.x/SVR4.2 solution to this problem is to introduce a
"fixed" scheduling class, so that a fixed percentage of the CPU time
is dedicated to processing the X server code.  This doesn't stop it
from being paged out, but it does provide a fixed amount of CPU to
spend paging it back in, and then doing some processing on top of
that (basically, I/O is accounted on a per process basis).  This
was basically a lazy way of introducing a "precious" working set low
watermark for the X server pages.  Much better to have established
a per-file quota for the .o files themselves, so ld might thrash,
but the only program that would get hurt by it would be ld.

Now consider the specific case we were dealing with, which is the
NetWare for UNIX (formerly Portable NetWare) problem.  In this case,
if we were to use the SVR4/Solaris threading (this was after the
merge of the Solaris and SVR4.2 code bases, as part of a joint
project between Sun and USL, in which Sun got VM and FS code, and
USL got the threads and some other code, in trade).

The implementation paradigm for this code was as "anonymous work
to do engines" -- essentially, the server consisted of a number of
specific tasks (an intention mode transaction based long manager,
a monitoring daemon, some miscellaneous tasks), and a number of
identical tasks which implemented "work to do engines" -- all the
latter tasks were identical, in that the client context for any
client session was known to all of them.  As a result, any of these
tasks could service any request.  Since the NCP packets are, with
the sole exception of delayed lock grants, which are reported async
via a covert channel, request/response in architecture, the number
of concurrent client requests is limited by the number of tasks that
are available to service the requests.  Our intent was to be able to
service a large number of clients.

Now consider that, while maximum concurrency was an issue, so was
locality of file data sets, and locality of code pages, with the
two contending for the limited available divided memory pools that
were contended between the VM and buffer cache (effectively, there
was a total set size, with a reserve held back for each type of
pages, and the remaining pages were contended).

Use of the "fixed" scheduling class was not an option.

Using threads would not allow prefferential scheduling between the
tasks, neither would it have allowed sharing of all client context
(though it would have allowed descriptor sharing) without some form
of marshalling and locking.  This is because a client that did not
believe the server was responding "fast enough" would repeat the
request.  It was necessary to respond to these clients with a
"server busy" message.  The reason behind this because IPX is a
unreliable datagram protocol, like UDP, and does not have a retry
mechanism built into it.

The upshot of this is that, with threads, the per process working
set would be very large, and would be fragmented across the process
address space.  This increased contention, well above what a process
could withstand, without forcing VM pressure on the buffer cache.
But the reason for the existance of the software was a *file* server,
so this was unacceptable.

By seperating the address spaces, this pressure was reduced, and
the amount of overall contention was reduced, thus reducing the
buffer cache pressure from the processes.

In the limit, with all processes fully utilized (i.e. a request
backlog at the stream MUX), it equalled out in performance.  In
the common case, however, not all tasks were utilized all the
time, and it was possible to allow them to be paged out.

On top of this, there were a number of speed benefits to System V
shared memory for the client contexts; if you have read "The Magic
Garden Explained", these should be pretty obvious.  John Dyson
made a number of similar changes to the FreeBSD implementation for
Oracle Corporation, when Oracle was using FreeBSD as the basis of
its "Network Computer" server.  Basically, the pages are VM pages
only, with no write-through to the backing store; in the SVR4 case,
this would have been buffer cache pages, backed by swap, if this
were anonymous memory instead (the kind you get in a threads heap).

Thus we come to part 2, which is that we modified the streams MUX
to ensure that requests were assigned to engines as they entered
the stream mux FD with a write+read request in LIFO, rather than
FIFO order.  By doing this, were able to ensure that, most likely,
the pages which were going to be requested were "in core" in the
process making the request (performing default FIFO ordering would
have resulted in a guarantee that the pages were not in core).  I
dubbed this approach "hot engine scheduling".

Attempting to use a similar approach in the threads case, besides
the completely fragmented process memory that caused a much larger
number of pages to need to be resident to do the same work, the
MUX assignment of "work to do" would in fact have been effectively
"random".

Anything less than total utilization of the system was *worse* with
random allocation of work units, and *better* with LIFO allocation.

And that's why using processes instead of threads resulted in less
page thrashing.

There were, of course, other reasons for using processes, instead
of threads, the primary among which was "better quantum utilization"
(SVR4.2/UnixWare 2.0 did not support thread group affinity in the
scheduler; as you have discovered, supporting that is NP-hard, unless
you get tricky, and make migration explicit and initial selection
intentional).

-- Terry