FreeBSD 5.3 Bridge performance take II

Wed Sep 8 15:47:09 PDT 2004

On Tue, 7 Sep 2004, Gerrit Nagelhout wrote:

<lots of good stuff snipped so I can respond to one point quickly, and
 then come back to the rest later today or tomorrow>

> From these numbers, the uma locks seem to get called twice for every
> packet, but have no collisions.  All other locks have significant
> collision problems resulting in a lot of overhead. 

UMA grabs per-cpu cache locks to check the per-cpu cache, where common
case free and allocations occur from.  This is something I'm currently
exploring in a Perforce branch by associating caches with additional
objects/entities, such as threads and interfaces.  The per-cpu locks serve
a number of functions, however, and there are trade-offs in looking at
weaker synchronization models.  Here are some things that these locks do:

- Synchronize access to the per-cpu uma_cache for the zone in the event
  that the caller gets preempted, leaving the cache consistent in the view
  of the thread, not the CPU.

- Synchronize access in the event a thread migrates CPUs.  As such, the
  thread can continue to reference the cache from another CPU and finish
  up whatever it's doing "safely".

- Allows global access to the per-cpu uma_caches for the purposes of
  draining and statistics collection.

- Allow global access to per-cpu UMA caches for the purposes of destroying
  a UMA zone.

In the rwatson_umaperthread branch, what I've done is started to associate
struct uma_cache structures with threads.  Since caches are "per-zone", I
allow threads to register for zones of interest; these caches are hung off
of struct thread, and must be explicitly registered and released.  While
this approach might not be desirable in the long term, it allowed me to
experiment at low implementation cost.  In particular, for ithreads and
netisr's, I'm running with caches for each of the mbuf-related zones,
mbufs, clusters, and packets.

In practice, this eliminates mutex acquisition for mbuf allocation and
free in the forwarding and bridging paths, and halves the number of
operations when interacting with user threads (as they don't have the
caches set up).  It also allows me to maintain these properties in the
presence of preemption and CPU migration and load balancing.  I've not yet
done a lot of performance measurement since I'm running into problems with
the if_em interface on the boxes I'm testing with wedging under high
packet queue depth; with lower depth using the UDP_RR test in netperf, I
see a several percent drop in processing latency.  Without testing under
higher volume, though, it's hard to reason about the over all benefits.  I
hope to be able to start doing more effective performance testing on this
in the near future.  I had hoped for slightly better improvements in the
face of removing those mutex operations; switching to direct dispatch in
the network stack, in contrast, has a far more dramatic effect on
processing latency.

There are some immediate downsides, however:

- First is that the caches can no longer be accessed from other threads
  safely, so we can't drain per-thread caches except in the context of the
  thread. 

- Since my experimental model maintains the notion that caches are
  maintained per-zone and not across UMA, threads have to notify UMA in
  advance as to what memory types are particularly important.  This is
  easy for ithreads on network drivers and the netisr, but harder in the
  general case (which also matters :-).

- Removing of zones is now harder, since global access to caches is
  restricted in the current model. 

My interest in looking at per-thread caches was to explore ways in which
to reduce the cost of zone allocation without making modifications to our
synchronization model.  It has been proposed that a better way to achieve
the same results is to lower the cost of entering critical sections, which
would have the effect of pinning the thread to the current CPU (preventing
migration) and also preventing preemption.  Right now, our critical
section cost is quite high (no measurements on hand), suggesting that
using locks on per-cpu structures doesn't actually put us in worse
situation.  Moving to critical sections would also complicate the act of
tearing down UMA zones (etc).  In the per-thread UMA cache model, I gloss
over this (since it's experimentation) by simply declaring that zones
declared as supporting per-thread caching can't be destroyed.

One nice thing about using this experimental code is that I hope it will
allow us to reason more effectively about the extent to which improving
per-cpu data structures improves efficiency -- I can now much more easily
say "OK, what happens if eliminate the cost of locking for common place
mbuf allocation/free".  I've also started looking at per-interface caches
based on the same model, which has some similar limitations (but also some
similar benefits), such as stuffing per-interface uma caches in struct
ifnet.

Since there are additional costs associated with more extensive use of
critical sections (such as the impact on timely preemption, load
balancing, etc), we should be in a better position to do a useful
comparison as work is done to improve the performance of our critical
section functionality.

BTW, right now my primary areas of optimization and work focus for the
next few weeks in the stack are:

- Lowering costs (and eliminating costs) associated with entropy
  harvesting in the interrupt and network paths.  Right now it's somewhat
  scary how much work is done there.  If you're not already disabling
  harvesting of entropy on interrupts and in network processing, you
  really want to for performance purposes.

  I have changes in the pipeline that halve the number of mutex operations
  during harvesting of entropy, and reduce to O(4) the number of mutex
  operations during entropy processing in the Yarrow thread (from O(4N)).

  Also, there's some "hard work" going on for CPUs without cycle counters
  due to timing information collection.

  I'd like to eliminate the entropy harvesting point in ether_input() --
  it strikes me as both redundant (called many times in close succession)
  and incorrect (processes the wrong data). 

- Running additional traces on the network processing path using KTR to
  identify weaknesses in performance.  In particular, to look at context
  switching (especially for gratuitous wakeup and poorly timed thrashing, 
  delays in processing, et al), mutex acquisition/drop, excess or
  gratuitous memory allocation, inefficient memory copies, etc.

- Spend additional time on IPv6 locking and safety in an MPSAFE kernel.

- Re-work BPF locking (and other aspects of its behavior) due to reported
  bugs, locking weaknesses, etc.

- Continue work on KAME IPSEC locking.

- Measure contention in the pcbinfo locking models used by several
  protocols, and start to identify locking strategies that mitigate that
  contention (and hopefully lower cost also).  I'm thinking of looking at
  changing the reference model for so_pcb pointers into per-protocol
  pcb's, since they currently tend to point at fairly heavy weight locking
  models, but need to do some more research first.

- Start to explore models for processing packets in sets for
  somewhat-indirect-dispatch.  We do some fairly inefficient things, such
  as fragmenting a datagram into many packets and passing them one-by-one
  into network processing.  We do that in other places also, such as
  crossing layer boundaries, etc, etc.

Things I hope to see others working on (:-) include optimizing
synchronization primitives (such as mutexes, wakeup/sleep events, critical
sections, etc), performing similar sorts of analysis to the above, and
spending time on driver locking to see how efficiency can be improved.  I
also measured substantial contention between send and receive paths in
heavy processing, but I'm not very familiar with our non-synthetic network
interface drivers.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Principal Research Scientist, McAfee Research