1:1 Threading implementation.

Wed Mar 26 20:55:46 PST 2003

Jeff Roberson wrote:
> On Wed, 26 Mar 2003, Terry Lambert wrote:
> > Jeff Roberson wrote:
> > > Well, I wasn't doing userland stuff until three days ago.  I think mini
> > > has just been very busy with work.  I suspect that you're going to need
> > > to start doing userland work or find someone to do it if you want to get
> > > it done soon.
> >
> > In theory, the library API will be identical to the pthreads
> > standard, and will require no changes to programs written to
> > that standard.  Most threaded programs these days are written
> > to the standard.  Some threaded programs make invalid assumptions
> > about rescheduling following an involuntary context switch, or
> > ability to make particular blocking calls.
> 
> I'm not sure what API compatibility has to do with anything?

The API is the only thing that matters about userland work,
apart from side effects of assumptions which end up visible
to users of the API, due to it being non-reflexive, of course.

In other words, your suspicion about the userland work is
incorrect.

> > The first will still be a problem (e.g. Netscape's Java/JavaScript
> > GIF rendering engine is probably still serializing requests due to
> > non-thread reentrancy).
> >
> > The second should not be an issue, either with your implementation
> > of the 1:1, or Jon Mini's implemenetation of the N:M model.
> 
> I'm not sure I know what you're talking about.  Blocking calls are either
> handled by an upcall in M:N or by having independent contexts in 1:1.

The current FreeBSD user space pthreads implementation causes
versions of Netscape that do not intentionally serialize GIF
renderings in Java/JavaScript UI's, and which contain multiple
images, to fail catastrophically, if the mouse was moved over
the GIF during loading.

This was true for some implementations of Slashdot, and it was
also true for the InterJet UI, as of version 3.x of WhistleWare
(based on FreeBSD 3.x), until specific code was included in the
UI to delay GIF request processing so as to serialize requests.

The specific problem, which I suspected from symptoms, and later
confirmed by decompiling the Netscape code in question was that
the rendering engine was non thread-reentrant, and was making an
assumption about scheduler behaviour that was not warranted by
anything other than a kernel threading implementation that would
guarantee resumption of the previously preempted thread.  This
assumption was true on Windows, true on Linux, true on Solaris,
but false on FreeBSD and false on MacOS 9.  And, in fact, we
saw UI crashes on FreeBSD and Mac OS 9, which did not occur on
other platforms.

The problem is in the POSIX interface not enforcing against
people making unwarranted assumptions in their code which uses
the POSIX interface.

Therefore, it is possible that some bugs in vendor code will
be revealed by differences in implementation.  The Netscape
4.x GIF renderer is one such piece of code.

For the record, blocking calls can be handled in the 1:1
case through upcalls, as well.  It's an implementation
detail that's irrelevent to the API that gets exposed.

> > > > Wouldn't it have been easier to have one KSEGRP+KSE+thread per user
> > > > thread? Having one ksegrp and many KSEs requires changing the kernel
> > > > code where doing it the other way you could do it without making any
> > > > changes.
> > >
> > > I don't understand?  There are relatively minor changes to the kernel to
> > > support this.  Since nice is a property of the process, it makes sense
> > > that there is only one ksegrp per process.  I'm starting to think that the
> > > ksegrp was overkill in general.
> >
> > The KSEGRP is, effectively, a virtual processor interface, and
> > was/is intended for use by the scheduler to ensure CPU affinity
> > for individual threads, and CPU negaffinity for multiple threads
> > within a process.  In other words, according to the published
> > design documents, it's a scheduler artifact.
> 
> This is the KSE, not the KSE group.  The KSE Group was intended to allow
> multiple groups of threads with different scheduling algorithms or
> different base priorities (nice).

This is *one* of a number of *important* effects of the KSEGRP;
the other effects are subtle, but are no less important.

> > Personally, I've never seen the need for virtual processors, but
> > then I've always advocated "intentional start/intentional migration"
> > for the scheduler model (one of my arguments for a per CPU migration
> > queue, and a push-model, rather than a pull-model for redistribution
> > of an unbalanced load).
> 
> The push model suffers from as much as a one tick latency in any
> migration.  In many cases probably more than that.

You've made this statement before, and then never defended it.

I think you mean that it has a latency of .5 of the average of
*quantum used*, +/- .5, in the worst case scenario.  That happens
when you push a process to another CPU, and it doesn't notice
until the next context switch (which is *rarely* a full quantum,
and is *usually* a fraction of a quantum).

This is because you are incorrectly pushing the process at the
head of the run queue.  This will, indeed, introduce the latency
you suggest (a smaller latency than you imply, BTW, and probably
ignorable, in fact).

HOWEVER.  The *correct* implementation will push the *second from
the head*, and then schedule the head to run.  This will ensure
that the latency is no more than 1 full quantum *for a thread
that would not run for a maximum of 1 full quantum.

In addition, since it is being pushed to the least loaded CPU,
and load is a measurement of number of items pending on the
ready-to-run queue, I would argue that, in fact, the push model
results in *significantly reduced* latency, on average, compared
to the pull model.

In addition, it eliminates all scheduler lock contention in
the common case, which is the non-migration case.

BTW, with a *pull* model, which requires an average of 1.5
lock contentions in order to accomplish a context switch for
an individual CPU in a 2 CPU system, on a 3GHz processor with
a 433MHz front size bus, this equates to ~7 clock cycles worth
of stall barrier per lock contention, or 10.5 clocks per
scheduler lock acquisition *by its own CPU*, even if you
decide not to migrate *anything*.

That *assumes* that all locks are allocated so as to fall on
cache line boundaries, which FreeBSD *FAILS* to do.

> The overhead from locking queues is far out weighed by the
> cases where your cpu is sitting idle due to an unbalanced load.

I suggest your numbers are in error.  Please examine the papers
on this topic in the book:

        Scheduling and Load Balancing in Parallel and Distributed
                Systems
        Behrooz A. Shirazi
        IEEE Computer Society
        ISBN: 0818665874

It is a compilation of IEEE papers on the topic, and contains
dozens of papers with statistics that refute your claims for
shared memory multiprocessors.

And most of these papers assume the clock multiplier is very
small, if it exists at all.  On modern systems, the amount of
overhead in their statistics for locking should be multiplied
considerably, since the clock multiplier is much higher than
it was.

> The latency is one tick because each cpu would have to poll the load of
> other cpus at some interval to discover that it is out of balance.  Or it
> could check if it had more than one process on the run queue, which seems
> a bit silly.  Regardless, you're probably only going to get to make this
> decisions once a tick which means the other cpu(s) can sit idle for at
> least that long.

I have one question: how did you get into this unbalanced
load situation, where you have 100 processes ready to run on
one CPU, and 0 processes ready to run on the remaining 3 CPUs?

I argue that this situation might in fact be an initial state,
but the steady state over time would be to evenly distribute
work, over time.

In other words, your example refers to what's called a "flash
crowd" case -- roughly equivalent to a fork-bomb.  And I have
already stated, at least for the thread creation case, that I
support "intentional start", where the CPU you pick to put an
initial new thread on, is based on the load.

So I do not understand how this situation could arise, other
than in laboratory conditions.

> Consider a buildworld -j8.  You have many processes rapidly stoping and
> starting.  Without a pull a cpu that was very loaded could suddenly end up
> with no running processes and have to idle until the other gave it work.
> This imbalance is likely to go back in forth, I have observed this
> personally when writing ULE.

The rapidity of the start/stop is irrelevent.  The instantaneous
load at the time of the next start *is* relevent.

I would argue that steady-state performance is more important;
further, I would suggest that, in using "make -j#", that you
select "#" to be a factor of 3 or more larger than the number
of available CPUs, to ensure correct hysteresis.

If this still doesn't fix your problem, I would suggest that
the duration of your quantum (lbolt value) is too large,
compared to how long processing actually occurs, before it
hits a blocking sleep call.

> I think you need both push and pull.  The pull satisfies the case where
> you have short lived but rapidly reappearing processes.  The push
> solves more long term load imbalance issues.  If you have, for example,
> many apache processes that are very busy.  no cpu will go idle, so pull is
> ineffective, but they may still be imbalanced.  This is still missing from
> ULE.

I don't think you can implement pull without locking your own
queue in order to access it.  I *know* you can implement push
without locking your own queue, *ever*, and then only deal with
the locking of a per CPU auxillary queue when you decide you
have to migrate a process.  I argue that this should occur only
*rarely*: it is not the common case.

Further, I don't think you can implement both push an pull in
the same implementation, reasonably.

The problem comes down to whether or not you engage in the
examination of another CPUs scheduling queue.  If you do this,
you have to lock, and you end up stalling both CPUs in order
to do this.

This is a factor of 2 multiplation, minimally, on the stall,
and is probably a heck of a lot more, in FreeBSD, since all
the other CPUs are doing the same thing, and you have L1 and
L2 cache flushes and TLB shootdowns, etc., as a result.

> 
> > In a scheduler model where a sheduler *pulls* work, either from
> > another CPU ready-to-run queue, or from a single ready-to-run
> > queue that is global to the system (in either case, requiring
> > locks in the scheduler path, potentially highly contended locks),
> > the idea of a KSEGRP/"virtual processor" is necessary for globally
> > migratable and contendable "bookkeeping" objects.
> 
> They should only be contended when cpus have nothing to do.  A worthwhile
> tradeoff I'd say.

Define "nothing to do".  The cached lock structure gets zapped in
all other processes which have it read-caches, as soon as it's
written by any CPU to acquire the lock.  Minimally, it's reread,
as necessary, from the L2 cache (the last operation is a write to
release the lock).  Worst case, it's main memory, and your stall
goes up by a factor of 4.

It's clear to me that shared memoy SMP systems with large clock
multipliers *must* pretend that they are distinct NUMA CPUs, as
much as possible, in order to avoid stall barriers.

BTW: the pull model does not work for NUMA systems, since the
memory you are attempting to examine is non-local, and a
distributed cache coherency and messaging protocol must be used
to get the data -- if it's available at all.  So FreeBSD SMP is
screwed from ever running on 64 processor SPARC boxes (for
example), if it uses the pull model.

The push model, on the other hand, can message the process into
the queue of the target CPU, using the built-in hardware messaging
mechanism (a cooperative transfer of the image has to happen as a
result of the message, but that particular overhead is largely
avoidable, using state synchronization via swap, and latency can
be further reduced through checkpointing).

> > So in the current scheduler implementations, KSEGRP is necessary;
> > in the 1:1 model, it's necessary, if only to ensure negaffinity
> > (4 CPU system, process with 4 threads, ensure each thread gets its
> > own CPU, and does not migrate away from it).
> 
> You're talking about the KSE again.  I think CPU affinity has little to do
> with the M:N or 1:1 choice except that it is much more difficult to
> achieve CPU affinity when you have to make a multitiered scheduling
> decision.  To get real affinity in M:N you need kse to cpu affinity and
> thread to kse affinity.  You also the need userland thread to kernel
> thread affinity, or at least user land thread to KSE affinity.

What do you think is on the scheduler queue or the wait queue,
if it's not a KSE?  There's no such thing as a thread, distinct
from the context in which it exists.

> > You could also take this idea much further.  Specifically, SVR4
> > flags system calls as "non-blocking", "blocking", and "potentially
> > blocking".  By doing this, they can lazy-bind context creation for
> > blocking operations on "blocking" and "potentially blocking" calls,
> > and avoid it altogether on "non-blocking" and sometimes avoid it on
> > "potentially blocking" calls.
> 
> KSE already does better than this by only creating a new context when you
> actually block.  The upcall mechanism specifically addresses that need.
> This is seperate from what we were discussing above which is allowing the
> scheduler to have a chance to initialize data when a new context is
> created.

The point is that there is "low hanging fruit".

By knowing up front that there is no chance of blocking, you
can play "fast and loose".

It seems to me from watching the -CURRENT code, that people
can't decide if they are grabbing locks to protect data
objects, or locks to protect code paths.  This resolves a
lot of the redundant locking that happens by giving only a
single rule of thumb, and a place where it can be ignored.

> > This can result in a significant overhead savings, if the kernel
> > implementation evolves, but the user space implementation remains
> > fixed.
> >
> > It's good to decouple these things from each other (IMO).
> 
> Which things?

The idea of kernel entrancy, and the continued need for a
context which can be put on a sleep queue vs. put on a
scheduler queue.  That's not distinct in the current
implementation.  In fact, the same list element pointer
in the same structure is used to link both lists.

> > Everyone does this.  Novell did it back in 1993.  Sun's turnstiles
> > are based on the tradeoff between spinning and waiting, and how
> > many times you have to do that before it's worth crossing the
> > protection domain, and blocking.
> 
> I think you mean sun's adaptive mutexes.  The turnstile is just the
> queue that you block on if I'm remembering correctly.  The blocking queue
> I used for umtx is a similar context where the queue migrates among the
> blocking threads.

Yes, adaptive mutexes, sorry.

> > When we did this in 1993 (Novell's implementation was primarily
> > by Dave Hefner, who now works for Microsoft, I believe), we ended
> Any relation to hugh?

He hates that.  8-).

> > > > My only comment is that since mini is supposed to be doing the
> > > > M:N library, isn't this a bit of a distraction?
> > >
> > > I'll let him comment on this.
> >
> > I'll stick my nose in: I think it's a good idea, since TPTB have
> > recently made noises on a couple of FreeBSD lists about "rapidly
> > approaching deadlines for the KSE work".
> >
> > Consider it insurance on your investment, people.
> 
> Yes, it isn't necessarily a KSE replacement.

But maybe it is, and will be for 6 months, or a year, if it uses
the same kernel mechanisms for its implementation.  That's why
Julian's comments about the kernel changes are important.

Note: I'm not saying they aren't actually necessary, only that
they merit discussion.  So far, the justifications you've offered
all revolve around your percieved irrelevancy of KSEGRP seperate
from process, as a container object.

This is true in ULE, as you've implemented it so far, but it's
probably not true, overall.

> > There is also the fact that affinity and quantum are very hard to
> > maintain on a system with a heterogeneous load.  In other words,
> > 1:1 looks good if the only thing you are running is a single
> > multithreaded proces, but looks a *lot* less good when you start
> > running real-world code instead of fictitious benchmarks that
> > try to make your threading look good (e.g. measuring only thread
> > context switches, with no process context switch stall barriers,
> > etc.).
> 
> Yes, I see what you're getting at.  M:N allows you to keep running until
> you've exhausted your whole slice by selecting another thread.  You could
> acomplish this in 1:1 by loaning your slice to the next available thread
> that was bound to the same cpu and force a switch to that.  That's a neat
> idea.  I'll have to look into this for ule.

It's hard to do correctly in the kernel, because the scheduler
that's making the decision has to either support a variable
quantum granularity (I've seen it implemented that way before,
but it's ugly), or it has to try and make "fairness" decisions
that it's not in a position to make.

For example, a thread calls and gives up it's quantum, and then
other threads in the same process run, because you're not out
of quantum, and then the first threads wait condition is
satisfied: who do you schedule first?  The answer has to be a
PTHREAD_SCOPE_PROCESS prioritization policy.  8-(.

> > I can tell you from personal experience with such a model, that
> > it *VASTLY* outperforms a 1:1 kernel threading model, even if you
> > end up running multiple state-machine instances on multiple CPUs.
> > We got more than a 120X increase in NetWare for UNIX, simply by
> > changing the client dispatch streams MUX to dispatch to worker
> > processes instead of threads, in LIFO instead of FIFO order,
> > simply because it ensured that the process pages you cared about
> > were more likely to be in core.
> 
> Yeah, the LIFO trick is widely used.  I believe apache does something of
> this sort.  It's also discussed on the c10k problem page.  I'm not sure
> why you got better perf out of processes than threads though.  This is
> sort of confusing.

I could avoid competing with other processes in the system
for scheduler quantum, and overall scheduler usage, and
system time, as a result, were reduced.

This was partially a result of the "quantum lending" I spoke
of; it was actually called "It's my damn quantum!" in the
presentation we made.  8-).  The idea is that if the system
gives me a quantum to use... it's my damn quantum!  And I
should not have to sacrifice it, merely because I have a
single context out of many that wants to make a call that
would block.  By using this approach, if you are running
heterogeneous processes, using 1/16th of your quantum doesn't
result in you paying a complete context switch overhead for
having all your threads compete with, say, "cron", running
once a second -- if you lose, you pay a full context switch
overhead.

The kernel boundary crossing is also very expensive in SVR4;
FreeBSD has reduced this somewhat, but it's still pretty far
behind Linux, in this regard, so it's not as cheap to switch
threads in kernel space as in user space.  It's not under
Linux, either, but they only every benchmark homogenous threads
in a single application on a relatively quiescent system.  There
are lies, damn lies, and statistics... then, there's benchmarks.

> > 1:1 threading is useful for one thing, and one thing only: SMP
> > scalability of single image processes.  And it's not the best at
> > doing that.
> 
> It's also good at providing extra contexts to block on for IO
> worker threads.

So's AIO, and it works more efficiently.  So does kqueue, for
that matter.

> Furthermore, It's really good at being implemented quickly,
> which is especially important considering that it's 2003 and we
> don't have kernel supported threads...

OK, can't aregue with that one.  It's one of the reasons I
liked that you did your implementation in the first place.

8-).

> > > Furthermore, m:n's basic advantage is less overhead from staying out of
> > > the kernel.
> >
> > No, actually, it's the ability to fully utilize a quantum, and
> > to not have to make a decision between one of your own threads
> > and some other process, off the run queue, when making a decision
> > in the scheduler about what to run next.
> 
> Yeah, I just remembered this bit.  See my answer above.  I think I'll do
> this trick in ULE.

Good luck... it's very hard to do in a kernel scheduler, without
overly complicating things, I'm afraid.

> > > For people who are doing thread pools instead of event driven IO
> > > models they will encounter the same overhead with M:N as 1:1.
> >
> > This is actually false.  In 1:1, your thread competes with all
> > other processes, in order to be the next at the top of the run
> > queue.  Statitically, you are doing more TLB flushes and shootdowns,
> > and more L1 and L2 cache chootdowns, than you would otherwise.
> 
> This is the same argument about using your whole slice eh?

It's the inverse.  It's what gives the lie to most "benchmarks",
and why, if you are running a web server with CGIs, you get
much more terrible performance than your threads people said
youw would get.  8-).

> > Solving this problem without intentional scheduling has been
> > proben to be N-P incomplete: it is not a problem which is
> > solvable in polonomyial time.
> 
> eh? Which problem is NP?

Solving the "Who do I run next to balance saving context switches
vs. fairness?", if you treat each voluntary context switch as a
restart of the timer until the next involuntary context switch.

Even lending is hard, once you get into the timer code and see
the evil things it does to get the lbolt clock, and the timer
optimizations on system call exit.  8-(.  But at least it's
not NP incomplete.  8-).

> > > I'm not sure what applications are entirely compute and have more threads
> > > than cpus.  These are the only ones which really theoretically benefit.  I
> > > don't think our threading model should be designed to optimize poorly
> > > thought out applications.
> >
> > By that argument, threads should not be supported at all... 8-) 8-).
> 
> I meant to say 'entirely compute bound'.  If you just want CPU and no IO
> then you probably only want as many threads as processors.  This is the
> most effecient arrangement.  I'm not arguing against threads although I do
> think they are often abused.

If the intent is optimization, the answer is never threads; that
was my point.  We would be teaching people to build finite state
automata, instead, and managing their own contexts.  I would
even argue that the code you get was better, since it would ensure
all your per session state never ended up in globals.  8-) 8-).

> > But by your same arguments, CPU clock multipliers have grown
> > to the point that memory bus and I/O bus stalls are so
> > expensive that SMP makes no sense.
> 
> I migh agree with you there.

Yeah, they've pissed me off, ever since my 486DX-50 (*not* DX/2-50!).
8-).

> > > Then if you look at the number of system calls and shenanigans a UTS must
> > > do to make proper scheduling decisions it doesn't look like such an
> > > advantage.
[ ... ]
> > I think the kernel boundary crossing overhead, and the fact
> > that, in doing so, you tend to relinquish a significant
> > fraction of remaining quantum (by your own arguments) says
> > that protection domain crossings are to be avoided at all costs.
> 
> Yes, I agree, and without serious tweaking our current M:N significantly
> increases the number of system calls.

Yes.  The signal masking is particular heinous.  I don't know
what to do about it.  8-(.

My gut reaction is "BSD signals"; before all this POSIX crap
turned BSD into SVR3, interrupted system calls restarted by
default.  There's a nice threads package from ~1988 that used
this fact, called "sigsched"; it's in the comp.sources.unix
archives.  Doesn't work any more, unless you call siginterrupt()
and then avoid POSIX signal interfaces.  8-(.

> > I'm glad you pursued it, even though I do not agree with your
> > reasoning on the value of N:M vs. 1:1.  I view it as "life
> > insurance" for the KSE code, which some people might be
> > otherwise tempted to rip out over some arbitrary deadline.
> >
> > Thank you for your work here, and thank everyone else for
> > their work, too.
> 
> Thanks for the feedback.  It has been stimulating.  I still need to
> consider multithreading implications of 1:1 for ULE.  This has given me a
> bit more to work on there.

I wish you had been at the original SMP meetings with Jason Evans,
Matt Dillon, and the 50+ other folks who showed up each time; it
would be a lot easier if everyone had the same context.  8-(.

In the quantum lending, be sure that you look carefully at the
involuntary context switch timer, and when it gets reset.  It's
scary in there.  8-).

-- Terry