1:1 Threading implementation.

Wed Mar 26 10:58:03 PST 2003

On Wed, 26 Mar 2003, Terry Lambert wrote:

> Jeff Roberson wrote:
> > Well, I wasn't doing userland stuff until three days ago.  I think mini
> > has just been very busy with work.  I suspect that you're going to need
> > to start doing userland work or find someone to do it if you want to get
> > it done soon.
>
> In theory, the library API will be identical to the pthreads
> standard, and will require no changes to programs written to
> that standard.  Most threaded programs these days are written
> to the standard.  Some threaded programs make invalid assumptions
> about rescheduling following an involuntary context switch, or
> ability to make particular blocking calls.

I'm not sure what API compatibility has to do with anything?

> The first will still be a problem (e.g. Netscape's Java/JavaScript
> GIF rendering engine is probably still serializing requests due to
> non-thread reentrancy).
>
> The second should not be an issue, either with your implementation
> of the 1:1, or Jon Mini's implemenetation of the N:M model.

I'm not sure I know what you're talking about.  Blocking calls are either
handled by an upcall in M:N or by having independent contexts in 1:1.

>
>
> > > Wouldn't it have been easier to have one KSEGRP+KSE+thread per user
> > > thread? Having one ksegrp and many KSEs requires changing the kernel
> > > code where doing it the other way you could do it without making any
> > > changes.
> >
> > I don't understand?  There are relatively minor changes to the kernel to
> > support this.  Since nice is a property of the process, it makes sense
> > that there is only one ksegrp per process.  I'm starting to think that the
> > ksegrp was overkill in general.
>
> The KSEGRP is, effectively, a virtual processor interface, and
> was/is intended for use by the scheduler to ensure CPU affinity
> for individual threads, and CPU negaffinity for multiple threads
> within a process.  In other words, according to the published
> design documents, it's a scheduler artifact.

This is the KSE, not the KSE group.  The KSE Group was intended to allow
multiple groups of threads with different scheduling algorithms or
different base priorities (nice).

> Personally, I've never seen the need for virtual processors, but
> then I've always advocated "intentional start/intentional migration"
> for the scheduler model (one of my arguments for a per CPU migration
> queue, and a push-model, rather than a pull-model for redistribution
> of an unbalanced load).

The push model suffers from as much as a one tick latency in any
migration.  In many cases probably more than that.  The overhead from
locking queues is far out weighed by the cases where your cpu is sitting
idle due to an unbalanced load.

The latency is one tick because each cpu would have to poll the load of
other cpus at some interval to discover that it is out of balance.  Or it
could check if it had more than one process on the run queue, which seems
a bit silly.  Regardless, you're probably only going to get to make this
decisions once a tick which means the other cpu(s) can sit idle for at
least that long.

Consider a buildworld -j8.  You have many processes rapidly stoping and
starting.  Without a pull a cpu that was very loaded could suddenly end up
with no running processes and have to idle until the other gave it work.
This imbalance is likely to go back in forth, I have observed this
personally when writing ULE.

I think you need both push and pull.  The pull satisfies the case where
you have short lived but rapidly reappearing processes.  The push
solves more long term load imbalance issues.  If you have, for example,
many apache processes that are very busy.  no cpu will go idle, so pull is
ineffective, but they may still be imbalanced.  This is still missing from
ULE.

> In a scheduler model where a sheduler *pulls* work, either from
> another CPU ready-to-run queue, or from a single ready-to-run
> queue that is global to the system (in either case, requiring
> locks in the scheduler path, potentially highly contended locks),
> the idea of a KSEGRP/"virtual processor" is necessary for globally
> migratable and contendable "bookkeeping" objects.

They should only be contended when cpus have nothing to do.  A worthwhile
tradeoff I'd say.

> So in the current scheduler implementations, KSEGRP is necessary;
> in the 1:1 model, it's necessary, if only to ensure negaffinity
> (4 CPU system, process with 4 threads, ensure each thread gets its
> own CPU, and does not migrate away from it).

You're talking about the KSE again.  I think CPU affinity has little to do
with the M:N or 1:1 choice except that it is much more difficult to
achieve CPU affinity when you have to make a multitiered scheduling
decision.  To get real affinity in M:N you need kse to cpu affinity and
thread to kse affinity.  You also the need userland thread to kernel
thread affinity, or at least user land thread to KSE affinity.

> It's also minorly useful to distinguish PTHREAD_SCOPE_SYSTEM
> priority groups, when running multiple threads on a single CPU
> (either in the common single CPU case, or in the less common SMP
> case), as a means of avoiding priority inversion deadlocks.  I
> would like to see this done differently, which would get rid of
> KSEGRP, but would add a scheduler architecture dependency, which
> I think can't be gotten rid of easily.  It's a tradeoff (as usual).
>
>
> > > i.e. on creation of a new process, shced_newproc() is called
> > > and a KSE is added in there is the scheduler in question wants to use
> > > KSEs. If it doesn't, no KSE would be added, but it's still possible that
> >
> > Yes, I think we need more sched hooks here as well.  Having only
> > sched_fork() makes things sort of gross.  We'll have to hook this all up
> > later.
>
>
> You could also take this idea much further.  Specifically, SVR4
> flags system calls as "non-blocking", "blocking", and "potentially
> blocking".  By doing this, they can lazy-bind context creation for
> blocking operations on "blocking" and "potentially blocking" calls,
> and avoid it altogether on "non-blocking" and sometimes avoid it on
> "potentially blocking" calls.

KSE already does better than this by only creating a new context when you
actually block.  The upcall mechanism specifically addresses that need.
This is seperate from what we were discussing above which is allowing the
scheduler to have a chance to initialize data when a new context is
created.

> This can result in a significant overhead savings, if the kernel
> implementation evolves, but the user space implementation remains
> fixed.
>
> It's good to decouple these things from each other (IMO).

Which things?

>
> > > This was discussed recently as being the highlight of someone's
> > > threading model (I think Linux but I am not sure who's).
> >
> > Yes, linux was discussing this.  It's a pretty common trick.  Even NT does
> > it but apparently NT allocates kernel resources for user locks.  I was
> > pretty pleased that I got away without any per lock allocations.
>
> Everyone does this.  Novell did it back in 1993.  Sun's turnstiles
> are based on the tradeoff between spinning and waiting, and how
> many times you have to do that before it's worth crossing the
> protection domain, and blocking.

I think you mean sun's adaptive mutexes.  The turnstile is just the
queue that you block on if I'm remembering correctly.  The blocking queue
I used for umtx is a similar context where the queue migrates among the
blocking threads.

> When we did this in 1993 (Novell's implementation was primarily
> by Dave Hefner, who now works for Microsoft, I believe), we ended
Any relation to hugh?

> up with 20,000 times the transcation per second performance of
> Tuxedo, which was the commercial record holder up to that point.

Sounds good.

>
> > > > The reason we're doing this in parallel with the M:N effort is so that we
> > > > can have reasonable threading sooner.  As I stated before, this project is
> > > > complimentary to KSE and does not prohibit it from working.  I also think
> > > > that the performance will be better or comparable in the majority of real
> > > > applications.
> > >
> > > My only comment is that since mini is supposed to be doing the
> > > M:N library, isn't this a bit of a distraction?
> >
> > I'll let him comment on this.
>
> I'll stick my nose in: I think it's a good idea, since TPTB have
> recently made noises on a couple of FreeBSD lists about "rapidly
> approaching deadlines for the KSE work".
>
> Consider it insurance on your investment, people.

Yes, it isn't necessarily a KSE replacement.

>
> > > You should be creating a new KSEGRP (subproc) per thread.
> > > I think you will find that if you do, things will fall out easier
> > > and you won't break the next KSE changes.
> >
> > I don't understand what I may break?
>
> See above for KSEGRP reasoning.  I think it's representative,
> but, if you have time, you may want to read the documentation
> for the KSE project.  If other people want to comment or correct
> my own comments in this regard (I have been largely an observer,
> since after the second threads meeting where my async call gate
> idea was brutally .. uh, "laid to rest" ;^)), they should feel
> free to do so.
>
>
> > > I'm not against having a separate 1:1 thread capability, but
> > > all this work could have been well spent getting M:N threads
> > > better supported and even getting it to
> > > be able to run in 1:1 mode a s a byproduct..
> >
> > I don't think M:N is the way to go.  After looking things over and
> > considering where it is theoretically faster I do not think it is a
> > worthwhile pursuit.
> >
> > First off, it is many months away from being even beta quality.  I think
> > the UTS is far more complicated than you may realize.  There are all sorts
> > of synchronization issues that it was able to avoid before since only one
> > thread could run at any time and there essentially was no preemption.  It
> > now also has to deal with effecient scheduling decisions in a M:N model
> > that it didn't have to worry about before.
>
> I would not recommend abandoning the idea, personally.  There is a
> huge -- and I mean *huge* -- amount of literature that likes the
> N:M model.
>
> There is also the fact that affinity and quantum are very hard to
> maintain on a system with a heterogeneous load.  In other words,
> 1:1 looks good if the only thing you are running is a single
> multithreaded proces, but looks a *lot* less good when you start
> running real-world code instead of fictitious benchmarks that
> try to make your threading look good (e.g. measuring only thread
> context switches, with no process context switch stall barriers,
> etc.).

Yes, I see what you're getting at.  M:N allows you to keep running until
you've exhausted your whole slice by selecting another thread.  You could
acomplish this in 1:1 by loaning your slice to the next available thread
that was bound to the same cpu and force a switch to that.  That's a neat
idea.  I'll have to look into this for ule.

>
> > I feel that this is an overwhelming amount of complexity.  Because of this
> > it will be buggy.  Sun claims that they still have open tickets on their
> > M:N while their new 1:1 implementation is totally bug free.  How long have
> > they been doing m:n?  I don't think that with our limited resources we're
> > going to be able to do better.
>
> You can't schedule resources.  They will work on what they want
> to, and let anything they don't like just sit there and "rot".
>
> The Sun claims are really specious, IMO.  They have it working,
> but how does it compare to, say multiple processes that are
> sharing descriptor tables, and not much else, in a work-to-do
> model?
>
> I can tell you from personal experience with such a model, that
> it *VASTLY* outperforms a 1:1 kernel threading model, even if you
> end up running multiple state-machine instances on multiple CPUs.
> We got more than a 120X increase in NetWare for UNIX, simply by
> changing the client dispatch streams MUX to dispatch to worker
> processes instead of threads, in LIFO instead of FIFO order,
> simply because it ensured that the process pages you cared about
> were more likely to be in core.

Yeah, the LIFO trick is widely used.  I believe apache does something of
this sort.  It's also discussed on the c10k problem page.  I'm not sure
why you got better perf out of processes than threads though.  This is
sort of confusing.

> 1:1 threading is useful for one thing, and one thing only: SMP
> scalability of single image processes.  And it's not the best at
> doing that.

It's also good at providing extra contexts to block on for IO
worker threads.  Furthermore, It's really good at being implemented
quickly, which is especially important considering that it's 2003 and we
don't have kernel supported threads...

> > Furthermore, m:n's basic advantage is less overhead from staying out of
> > the kernel.
>
> No, actually, it's the ability to fully utilize a quantum, and
> to not have to make a decision between one of your own threads
> and some other process, off the run queue, when making a decision
> in the scheduler about what to run next.

Yeah, I just remembered this bit.  See my answer above.  I think I'll do
this trick in ULE.

> If you favor the threads in your own process, then you potentially
> starve other processes.
>
> If you favor neither, and treat them like processes, you get none
> of the supposed context switch benefits that were supposedly going
> to result from using threads instead of processes in the first place.
>
>
> > First, if your application has more threads than cpus it is written
> > incorrectly.
>
> This depends on what those threads are doing.  If they are all doing
> the same work, then yes, you are right.  If they are doing different
> jobs, then you are wrong; even if most of them are doing the same job,
> and a few of them are doing different jobs, you are still wrong, since
> job-loading is unlikely to be the same between threads.
>
>
> > For people who are doing thread pools instead of event driven IO
> > models they will encounter the same overhead with M:N as 1:1.
>
> This is actually false.  In 1:1, your thread competes with all
> other processes, in order to be the next at the top of the run
> queue.  Statitically, you are doing more TLB flushes and shootdowns,
> and more L1 and L2 cache chootdowns, than you would otherwise.

This is the same argument about using your whole slice eh?

> Solving this problem without intentional scheduling has been
> proben to be N-P incomplete: it is not a problem which is
> solvable in polonomyial time.

eh? Which problem is NP?

>
> > I'm not sure what applications are entirely compute and have more threads
> > than cpus.  These are the only ones which really theoretically benefit.  I
> > don't think our threading model should be designed to optimize poorly
> > thought out applications.
>
> By that argument, threads should not be supported at all... 8-) 8-).

I meant to say 'entirely compute bound'.  If you just want CPU and no IO
then you probably only want as many threads as processors.  This is the
most effecient arrangement.  I'm not arguing against threads although I do
think they are often abused.

>
> > This means that the constraints are different from when
> > this architecture started to come about many (10 or so?) years ago.
> > Trying to optimize context switches between threads just doesn't make
> > sense when you do so much work per slice.
>
> 5/6 years ago, depending on who you ask.
>
> But by your same arguments, CPU clock multipliers have grown
> to the point that memory bus and I/O bus stalls are so
> expensive that SMP makes no sense.

I migh agree with you there.

>
> > Then if you look at the number of system calls and shenanigans a UTS must
> > do to make proper scheduling decisions it doesn't look like such an
> > advantage.
>
> I agree with this one; my original model avoided the problem
> entirely by making the POSIX blocking call behaviour a library
> on to of an sync kernel interface.  By doing this, kernel
> boundary crossings could be minimized automatically.
>
> The pthreads code as it has existed so far, also does a lot of
> unecessary kernel boundary crossings in order to handle signal
> masking.  In fact, you could establish an intermediate handler
> for all signals at the user threads scheduler level, and never
> have to worry about most of that crap.
>
> I think the kernel boundary crossing overhead, and the fact
> that, in doing so, you tend to relinquish a significant
> fraction of remaining quantum (by your own arguments) says
> that protection domain crossings are to be avoided at all costs.

Yes, I agree, and without serious tweaking our current M:N significantly
increases the number of system calls.

>
> > In short, even if it is marginally faster, it doesn't seem like it is
> > worth the effort and risk.  I don't want to discourage you from trying but
> > this is why I stopped working on KSE proper and pursued the 1:1 model.
>
> I'm glad you pursued it, even though I do not agree with your
> reasoning on the value of N:M vs. 1:1.  I view it as "life
> insurance" for the KSE code, which some people might be
> otherwise tempted to rip out over some arbitrary deadline.
>
> Thank you for your work here, and thank everyone else for
> their work, too.
>
> -- Terry
>

Thanks for the feedback.  It has been stimulating.  I still need to
consider multithreading implications of 1:1 for ULE.  This has given me a
bit more to work on there.

Cheers,
Jeff