FreeBSD 5.3 Bridge performance take II

Wed Sep 8 21:45:28 PDT 2004

:In the rwatson_umaperthread branch, what I've done is started to associate
:struct uma_cache structures with threads.  Since caches are "per-zone", I
:allow threads to register for zones of interest; these caches are hung off
:of struct thread, and must be explicitly registered and released.  While
:..
:
:In practice, this eliminates mutex acquisition for mbuf allocation and
:free in the forwarding and bridging paths, and halves the number of
:operations when interacting with user threads (as they don't have the
:..
: 
:My interest in looking at per-thread caches was to explore ways in which
:to reduce the cost of zone allocation without making modifications to our
:synchronization model.  It has been proposed that a better way to achieve
:...
:
:Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
:robert at fledge.watson.org      Principal Research Scientist, McAfee Research

    I would recommend against per-thread caches.  Instead, make the per-cpu
    caches actually *be* per-cpu (that is, not require a mutex).  This is
    what I do in DragonFly's Slab allocator.  For the life of me I just don't
    understand why one would spend so much effort creating a per-cpu caching
    subsystem and then slap a mutex right smack in the middle of the
    critical allocation and deallocation paths.  Non critical operations,
    such as high level zone management, can be done passively (in DragonFly's
    case through IPI messaging which, when I get to it, can be queued
    passively rather then actively), or by a helper thread which migrates
    to the cpu whos cache it needs to operate on, does its stuff, then
    migrates to the next cpu, or by any number of other clever mechanisms
    none of which require a brute-force mutex to access the data.

    I use this cpu migration trick for a number of things in DragonFly.
    Jeff and I use it for wildcard pcb registration (which is replicated
    across cpus).  The thread list sysctl code collects per-cpu thread
    data by iterating through the cpus (migrating the thread to each cpu
    to collect the data and then ending up on the cpu it began on before
    returning to user mode).  Basically, any non-critical-path operation
    can use this trick in order to allow the real critical path -- the
    actual packet traffic, to operate without mutexes.

    So, instead of adding more hacks, please just *fix* the slab allocator
    in FreeBSD-5.  You will find that suddenly a lot of things you were
    contemplating writing additional subsystems for will then suddenly work
    (and work very efficiently) by just calling the slab allocator directly.

    The problem with per-thread caching is that you greatly increase the
    amount of waste in the system.  If you have 50 threads each with their
    own per-thread cache and a hysteresis of, say, 32 allocations, you
    wind up with 50*32 = 1600 allocations worth of potential waste.  With
    a per-cpu case the slop is a lot more deterministic (since the number of
    cpus is a fixed, known quantity).  Another problem with per-thread 
    caching is that it greatly reduces performance in certain common
    allocation cases... in particular the case where data is allocated by
    one subsystems (say, an interrupt thread), and freed by another subsystem
    (say, a protocol thread or other consumer).  This sort of problem is 
    a lot easier to fix with a per-cpu cache organization and a lot harder
    to fix with a per-thread cache organization.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>