FreeBSD 5.3 Bridge performance take II
Matthew Dillon
dillon at apollo.backplane.com
Wed Sep 8 21:45:28 PDT 2004
:In the rwatson_umaperthread branch, what I've done is started to associate
:struct uma_cache structures with threads. Since caches are "per-zone", I
:allow threads to register for zones of interest; these caches are hung off
:of struct thread, and must be explicitly registered and released. While
:..
:
:In practice, this eliminates mutex acquisition for mbuf allocation and
:free in the forwarding and bridging paths, and halves the number of
:operations when interacting with user threads (as they don't have the
:..
:
:My interest in looking at per-thread caches was to explore ways in which
:to reduce the cost of zone allocation without making modifications to our
:synchronization model. It has been proposed that a better way to achieve
:...
:
:Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
:robert at fledge.watson.org Principal Research Scientist, McAfee Research
I would recommend against per-thread caches. Instead, make the per-cpu
caches actually *be* per-cpu (that is, not require a mutex). This is
what I do in DragonFly's Slab allocator. For the life of me I just don't
understand why one would spend so much effort creating a per-cpu caching
subsystem and then slap a mutex right smack in the middle of the
critical allocation and deallocation paths. Non critical operations,
such as high level zone management, can be done passively (in DragonFly's
case through IPI messaging which, when I get to it, can be queued
passively rather then actively), or by a helper thread which migrates
to the cpu whos cache it needs to operate on, does its stuff, then
migrates to the next cpu, or by any number of other clever mechanisms
none of which require a brute-force mutex to access the data.
I use this cpu migration trick for a number of things in DragonFly.
Jeff and I use it for wildcard pcb registration (which is replicated
across cpus). The thread list sysctl code collects per-cpu thread
data by iterating through the cpus (migrating the thread to each cpu
to collect the data and then ending up on the cpu it began on before
returning to user mode). Basically, any non-critical-path operation
can use this trick in order to allow the real critical path -- the
actual packet traffic, to operate without mutexes.
So, instead of adding more hacks, please just *fix* the slab allocator
in FreeBSD-5. You will find that suddenly a lot of things you were
contemplating writing additional subsystems for will then suddenly work
(and work very efficiently) by just calling the slab allocator directly.
The problem with per-thread caching is that you greatly increase the
amount of waste in the system. If you have 50 threads each with their
own per-thread cache and a hysteresis of, say, 32 allocations, you
wind up with 50*32 = 1600 allocations worth of potential waste. With
a per-cpu case the slop is a lot more deterministic (since the number of
cpus is a fixed, known quantity). Another problem with per-thread
caching is that it greatly reduces performance in certain common
allocation cases... in particular the case where data is allocated by
one subsystems (say, an interrupt thread), and freed by another subsystem
(say, a protocol thread or other consumer). This sort of problem is
a lot easier to fix with a per-cpu cache organization and a lot harder
to fix with a per-thread cache organization.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-current
mailing list