Strawman proposal: making libthr default thread implementation?

Wed Jul 5 01:19:08 UTC 2006

On Tuesday 04 July 2006 12:41 pm, Julian Elischer wrote:
> David Xu wrote:
> >On Tuesday 04 July 2006 21:08, Daniel Eischen wrote:
> >>The question was what does libthr lack.  The answer is priority
> >>inheritence & protect mutexes, and also SCHED_FIFO, SCHED_RR, and
> >>(in the future) SCHED_SPORADIC scheduling.  That is what I stated
> >>earlier in this thread.
> >
> >As other people said, we need performance,  these features, as you
> >said, in the future, but I don't think it is more important than
> > performance problem. you have to answer people what they should do
> > when they bought two cpus but works like they only have one, as the
> > major author of libpthread, in the past, you decided to keep
> > silent, ignoring such requirement. also, the signal queue may not
> > work reliably with libpthread, this nightmare appears again.
>
> As much as it pains me to say it, we could do with looking at using
> the simpler mode of 1:1
> as the default. M:N does work but it turns out that many of the
> promissed advantages turn out to be
> phantoms due to the complexities of actually implementing it.

At BSDCan, I tinkered with a checkout of the cvs tree, to see what the 
kernel side of things would look like if M:N support came out.  The 
result is an amazing code clarity improvement and it enables a bunch of 
other optimizations to be done with greater ease.  Things happen like 
being able to easily reduce the C code executed between an interrupt 
and ithread dispatch by about 75%.  This simplification enabled Kip to 
do a bunch of scalability work as well (per-cpu scheduling locks, 
per-cpu process lists, etc).

However, my objectives there were quite different to what Robert has 
raised.  My objectives were a 'what if?'.  People have complained in 
the past that the complexity that KSE adds to the kernel context 
switching code gets in the way of other optimizations that they'd like 
to try, so I figured that this would be a good way to call them on that 
and see if it really does help or not.  I was hoping to be able to 
present a list of things that we'd gain as a result, but unfortunately 
the cat is out of a bag a bit earlier than I'd have liked.  I never 
really intended to bring it up until there was something to show for 
it.  I know Kip has done some amazing work already but I was hoping for 
other things as well before going public.

FWIW, My skunkworks project is in perforce:
//depot/projects/bike_sched/...
and there is a live diff:  
http://people.freebsd.org/~peter/bike_sched.diff
(Yes, the name was picked long before this thread started)

It does NOT have any of Kip's optimization work in it.  It was just 
meant as a baseline for other people to experiment with.  I've tested 
it with 4bsd as the scheduler.  ULE might work, but I have not tried 
it.  SCHED_CORE will not compile in that tree because I haven't yet 
gone over diffs from David Xu yet.  I run this code on my laptop with 
libmap.conf redirecting libpthread to libthr.  It works very well for 
me, even threaded apps like firefox etc.

Anyway, back to the subject at hand.  The basic problem with the KSE/SA 
model as I see it (besides the kernel code complexity) is that it 
doesn't really seem to suit the kind of threaded applications that 
people seem to want to run on unix boxes.

In a traditional 1:1 threading system, eg: linuxthreads/nptl, libthr, 
etc, mutex blocking is expensive, but system calls and blocking in 
kernel mode is the same cost as a regular process making system calls 
or blocking in kernel mode.

Because Linux was the most widely and massively deployed threading 
system out there, people tended to write (or modify) their applications 
to work best with those assumptions.  ie: keep pthread mutex blocking 
to an absolute minimum, and not care about kernel blocking.

However, with the SA/KSE model, our tradeoffs are different.  We 
implement pthread mutex blocking more quickly (except for UTS bugs that 
can make it far slower), but we make blocking in kernel context 
significantly higher cost than the 1:1 case, probably as much as double 
the cost. For applications that block in the kernel a lot instead of on 
mutexes, this is a big source of pain.

When most of the applications that we're called to run are written with 
the linux behavior in mind, when our performance is compared against 
linux we're the ones that usually come off the worst.

I'm sure that there are threaded applications that benefit from cheap 
mutex operations, but I'm not personally aware of them.  I do know that 
the ones that we get regularly compared to linux with are the likes of 
mysql, squid and threaded http servers.  All of those depend on kernel 
blocking being as fast as possible.  Faster mutexes doesn't seem to 
compensate for the extra costs of kernel blocking.  I don't know where 
java fits into this picture.

We've proven that we can make KSE work, but it was far harder than we 
imagined, and unfortunately, the real-world apps that matter the most 
just don't seem to take advantage of it.  Not to mention the complexity 
that we have to work around for scalability work.

Speaking of scalability, 16 and 32 way systems are here already and will 
be common within 7.0's lifetime.  If we don't scale, we're sunk.  My 
gut tells me that we HAVE to address the complexity that the KSE kernel 
code adds in order to improve this.  We can barely work well on 4-cpu 
systems, let alone 32 cpu systems.

PS: I think it would be interesting to see a hybrid user level M:N 
system.  Even if it was as simple as multiplexing user threads onto a 
group of kernel threads (without M:N kernel support) and doing libc_r 
style syscall wrappers for intercepting long-term blockable operations 
like socket/pipe IO etc.  For short term blocking (disk IO), just wear 
the cost of letting one thread block for a moment.  I suspect that 
large parts of libpthread could be reused and some bits brought back 
from libc_r.  I think this would do a fairly decent job for things like 
computational threaded apps because mutexes would be really fast.

PPS: My opinions are not meant as a criticism of the massive amount of 
work that has gone into making KSE work.  It is more an attempt to step 
back and take an objective look at the ever-changing big picture.
-- 
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com
"All of this is for nothing if we don't go to the stars" - JMS/B5