FreeBSD 5.3 Bridge performance take II
dillon at apollo.backplane.com
Thu Sep 9 01:55:02 PDT 2004
:% One nice thing about using this experimental code is that I hope it will
:% allow us to reason more effectively about the extent to which improving
:% per-cpu data structures improves efficiency -- I can now much more
:% easily say "OK, what happens if eliminate the cost of locking for common
:% place mbuf allocation/free". I've also started looking at per-interface
:% caches based on the same model, which has some similar limitations (but
:% also some similar benefits), such as stuffing per-interface uma caches
:% in struct ifnet.
:I.e., using per-thread UMA caches is a 30-60 minute hack that allows me to
:explore and measure the performance benefits (and costs) of several
:different approaches, including per-cpu, per-thread, and per-data
:structure/object caching without doing the full implementation up front.
:Per-thread caching, for example, can simulate the effects of
:non-preemption and mutex avoidance in micro-benchmarking, although in the
:general case under macro-benchmark perspective it suffers from a number of
:problems (including the draining, balancing, and extra storage cost
:issues). I didn't attempt to address these problems under the assumption
:that the current implementation is a tool for exploring performance, not
:something to actually use.
Well, I see some major problems with this avenue of development.
I see this as an end-run around existing, broken (or perceived to be
If you don't believe that the slab allocator has severe performance
issues then the slab allocator (aka malloc()) should simply be used
directly. If you do believe that the slab allocator has severe performance
issues then the correct solution is to FIRST FIX THE SLAB ALLOCATOR.
Until the slab allocator is fixed the system-wide overhead will skew
the results from any other optimization tests you try to make. i.e.
results from other optimizations may appear to be less effective simply
due to being washed out by the slab allocator's overhead.
In the same respect, the idea that a per-thread memory cache is going to
be more efficient then a per-cpu memory cache implies that it is too
expensive to implement the locking required to implement a per-cpu
memory cache vs a per-thread memory cache. If that is the implication,
the solution is to fix the required locking. Frankly it should be no
more expensive then a critical section and a critical section to
access per-cpu data should be no more then a nanosecond or two more
expensive then access to a per-thread data structure.
Per-thread caching APIs have major design hurdles to overcome. I've
already listed a few of them, but there are many, many more. For example,
locality of reference may seem to be a slam dunk but you actually get
better locality of reference with a per-cpu cache, especially when a
thread migrates between cpus, but also because the cache is able to
take advantage of and reuse a very recently reused chunk of data
that might have been freed by some other thread on the same cpu... so
you get it even after a context switch (and you don't get that with a
per-thread cache). This is a case that can occur quite often, especially
when interrupt threads are shipping data to protocol threads. I could
go on and on, but I am not going to because I think it damn well ought
to be obvious.
It seems clear to me that it makes little sense to spend time
on a per-thread memory cache when a per-cpu memory cache is, just from
an algorithmic point of view, going to be far more effective, far
easier to manage, possible to have larger (deterministic) hysteresis
without creating too much non-deterministic slop, and so on and so forth.
If this is just an experiment, and therefore something that will never
be committed, then I still don't understand why you are even wasting time
working on it when you could be fixing the slab allocator instead.
IMHO I do not believe that a per-thread memory cache would even come
close to characterizing the performance benefits of a mutexless per-cpu
cache. While the base overhead is similar, the side effects and
management requirements are going to be very different. If the
experiment is supposed to characterize the potential performance
improvement over the slab allocator, then I believe it is already
too flawed to be an accurate measure of that.
Now does this mean that caches in front of the slab allocator are bad?
No, it doesn't. I use front-end caches in several places in
DragonFly and you definitely want to do the same thing in FreeBSD.
The difference, I think, is that we only use front-end caches for
performance-critical subsystems and when we do, we implement the
code directly into the subsystem. We depend on our core slab allocator
(i.e. malloc()) far more then you seem to want to depend on yours.
FreeBSD seems to depend on its UMA type-stable zone allocator for the
same thing but that has a lot of extra, unnecessary overhead. The
MBUF allocator is a good example of this. You guys are making several
extra procedural calls that we aren't in the mbuf allocation path.
The way I see it, if it's important enough to require its own front-end
cache over simply using the slab allocator, then you might as well go
But it is also important to make your slab allocator fast so there
are fewer situations where you feel that you need to bypass it. A
front-end cache is not supposed to be a workaround for a slow
slab allocator but instead is supposed to be an ultra-optimized
implementation to reduce overhead in a critical subsystem.
:In doing so, my hope was to identify which areas will offer the most
:immediate performance benefits, be it simply cutting down on costly
:operations (such as the entropy harvesting code for Yarrow which appears
:to have found its way into our interrupt path), rethinking locking
:strategies, optimizing out/coalescing locking, optimizing out excess
:memory allocation, optimizing synchronization primitives with the same
:semantics, changing synchronization assumptions to offer weaker/stronger
:Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
:robert at fledge.watson.org Principal Research Scientist, McAfee Research
Well, I guess identifying the problem areas is good, but it isn't
rocket science. It should be glaringly obvious without requiring
all that much actual testing.
One big problem you face is that a great deal of the performance
issues in FreeBSD are not from any single subsystem, but from overheads
sprinkled throughout the entire codebase. Taken singly, these overheads
are not significant. A single mutex could be argued to be not all that
significant... but having to obtain and release 11 mutexes in a code
path *IS* significant. A single memory allocation might not be
significant, but having to make three or four in a critical path can be.
And so on, and so forth.
Testing performance with little tweaks here and there is not going to
give you any worthwhile results, IMHO. Just fixing one thing isn't
going to solve the performance problem. You have to fix the entire
path. It isn't JUST the mutex overhead that's the problem. It's
the mutex overhead, the scheduler overhead, all the myrid calls you
have to make to pin the thread, or enter a critical section (if you
have to do it too often)... it's the atomic-access requirement to the
per-cpu %fs:globaldata data which makes it impossible to cache
per-cpu data. It's the 4BSD/ULE scheduler being used to schedule
kernel threads, its the thread switching overhead, the microtime calls
in the switch path, coding requirements to deal with preemption, cpu
migration, giant lock handling, uma zone allocator's callback API,
and a dozen other things. Taken singly these items produce a
fractional degredation. Taken singly these items aren't necessarily
even 'bad'. Taken together and you have... well, you have the
situation you find yourself in now.
To really fix the problem you need to be willing to clean up ALL of
these subsystems. Well, first you need to recognize that they all need
to be cleaned up, and I gather that some FreeBSD developers still don't
recognize that as the problem. Once you recognize that its a problem
you then need to go and do the work. Cleaning up the subsystems is only
the first step... it gives you a nice solid reasonably high performing
base to work with. You still have to deal with the higher-level
critical pathing issues once you've cleaned up the subsystems. This is
where things like front-end caches really show their stuff.
<dillon at backplane.com>
More information about the freebsd-current