Strawman proposal: making libthr default thread implementation?
rwatson at FreeBSD.org
Wed Jul 5 08:48:18 UTC 2006
On Tue, 4 Jul 2006, Peter Wemm wrote:
> Because Linux was the most widely and massively deployed threading system
> out there, people tended to write (or modify) their applications to work
> best with those assumptions. ie: keep pthread mutex blocking to an absolute
> minimum, and not care about kernel blocking.
> However, with the SA/KSE model, our tradeoffs are different. We implement
> pthread mutex blocking more quickly (except for UTS bugs that can make it
> far slower), but we make blocking in kernel context significantly higher
> cost than the 1:1 case, probably as much as double the cost. For
> applications that block in the kernel a lot instead of on mutexes, this is a
> big source of pain.
> When most of the applications that we're called to run are written with the
> linux behavior in mind, when our performance is compared against linux we're
> the ones that usually come off the worst.
The problem I've been running into is similar but different. The reason for
my asking about libthr being the default is that, in practice, our performance
optimization advice for a host of threaded applications has been "Switch to
libthr". This causes quite a bit of complexity from a network stack
optimization perspective, because the behavior of threading in threaded
network/IPC applications changes enormously if the threading model is changed.
As a result, the optimization strategies differ greatly. To motivate this,
let me give you an example.
Widely distributed MySQL benchmarks are basically kernel IPC benchmarks, and
on multi-processor systems, this means they basically benchmark context
switch, scheduling, network stack overhead, and network stack parallelism.
However, the locking hot spots differ significantly based on the threading
model used. There are two easily identified reasons for this:
- Libpthread "rate limits" threads entering the kernel in the run/running
state, resulting in less contention on per-process sleep mutexes.
- Libthr has greater locality of behavior in that the mapping of thread
activities to kernel-visible threads is greater.
Consider the case of an application that makes frequent short accesses to file
descriptors -- for example, by sending lots of short I/Os on a set of UNIX
domain sockets from various worker threads, each performing transactions on
behalf of a client via IPC. This is, FYI, a widely deployed programming
approach, and is not limited to MySQL. The various user threads will be
constantly looking up file descriptor numbers in the file descriptor array;
often, the same thread will look up the same number several times (accept,
i/o, i/o, i/o, ..., close). This results in very high contention on the file
descriptor array mutex, even though individual uses are short.
In practice, libpthread sees somewhat lower contention, because in the
presence of adaptive mutexes, kernel threads spin rather than blocking,
causing libpthread to not push further threads in to contend on the lock.
However, one of the more interesting optimizations to explore involves
"loaning" file descriptors to threads, in order to take advantage of locality
of reference, where repeated access to the same fd is cheaper, but revocation
of the loan for use by another thread is more expensive. In libthr, we have
lots of locality of reference, because user threads map 1:1 to kernel threads;
in libpthread, this is not the case, as user threads float across pthreads,
and even if they do get mapped to the same kernel thread repeatedly, their
execution in the presence of blocking is discontinuous in the same kernel
This makes things tricky for someone working on reducing contention in the
kernel as the number of threads increases: do I optimize for libpthread, which
offers little or no locality of reference with respect to mapping user thread
behavior to kernel threads, or do I optimize for libthr, which offers high
locality of reference?
Since our stock advice is to run libthr for high performance applications, the
design choice should be clear: I should optimize for libthr. However, in
doing so, I would likely heavily pessimize libpthread performance, as I would
basically guarantee that heuristics based on user thread locality would fail
with moderate frequency, as the per-kernel thread working set for kernel
objects is significantly greater.
FWIW, you can quite clearly measure the difference in file descriptor array
lock contention using the http/httpd micro-benchmarks in
src/tools/tools/netrate. If you run without threading, performance is better,
in significant part because there is much less contention. This is an
interesting, and apparently counter-intuitive observation: many people believe
that the reduced context switch and greater cache locality of threaded
applications always results in improved performance. This is not true for a
number of important workloads -- by operating with more shared data
structures, contention on those shared data structures is increased, reducing
performance. Moving to the two threading models, you see markedly better
libpthread performance under extremely high load involving many threads with
small transactions, as libpthread provides heuristically better management of
kernel load. This advantage does not carry over to real-world application
loads, however, which tend to use a smaller thread worker pools with sequences
of locality-rich transaction, which is why libthr performs btter as the
workload approaches real-world conditions. This micro-benchmark makes for
quite an interesting study piece, as you can easily vary the thread/proc
model, the number of workers, and the transaction size, giving pretty clear
performance curves to compare.
Anyhow, my main point in raising this thread was actually oriented entirely on
the initial observation, which is that in practice, we find ourselves telling
people who care about performance to use libthr. If our advice is always "use
libthr instead of the default", that suggests we have a problem with the
default. Switching the default requires an informed decision: what do we
lose, not just what do we gain. Dan has now answered this question -- we lose
support for a number of realtime scheduling primitives if we switch today
without further work.
I think the discussion of the future of M:N support is also critical, though,
as it has an immediate impact on kernel optimization strategies, especially as
number of CPUs grows. In case anyone failed to notice, it's now possible to
buy hardware with 32 "threads" for <$10,000, and the future appears relatively
clear -- parallelism isn't just for high-end servers, it now appears in
off-the-shelf notebook hardware, and appears to be the way that vendors are
going to continue to improve performance. Having spent the last five years
working on threading and SMP, we're well-placed to be to support this
hardware, but it requires us to start consolidating our gains now, which means
deciding what the baseline is for optimization when it comes to threaded
Robert N M Watson
University of Cambridge
More information about the freebsd-threads