FreeBSD 5.3 Bridge performance take II

Wed Sep 8 10:28:31 PDT 2004

Hi,

I have just finished some profiling and analysis of the FREEBSD_5_BP code 
running a standard 4-port ethernet bridge (not netgraph).  On the upside, 
some of the features such as the netperf stuff, MUTEX_PROFILING and 
UMA are very cool, and (I think) give the potential for a really fast bridge 
(or similar application).  However, the current performance is still rather 
poor compared to 4.x, but I think that with the groundwork now in place, and
some minor changes and a couple of new features, it can be made much much faster.
I would like to discuss some possible optimizations (will suggest some below), and
then we are willing to take on some of them, and give the code back to FreeBSD.
Hopefully these changes can be made on RELENG_5 to be used by by 5.4.
The tests that I have run so far have focussed on the different between 
running in polling mode (dual 2.8Ghz Xeon, 2 2-port em NICs) versus interrupt 
mode (with debug.mpsafenet=1, and no INVARIANTS/WITNESS or anything 
like that).  In both setups I actually get similar throughput (300kpps total in 
and out divided evenly over the 4 ports).  I think it should be possible to
get >> 1Mpps bridging on this platform.

In the polling case, there is still only one active thread, and the limiting
factor seems to be simply the number of mutexes (11 per packet
according to MUTEX_PROFILING), and overhead from UMA, bus_dma, etc.  
With polling disabled, I think the fact that PREEMPTION was disabled (I can't even
boot with it on), and some sub-optimal mutex usage resulting in a lot
of collisions caused some problems, even though in theory all 4 cores should
be able to run simultaneously.

Here is a sample profile (while in polling mode).  The cpu idle, halt etc are simply
indicating that 3 of the cores have nothing to do.  But it does give a pretty
good sense of where all time is being spent.  There are definitely a lot of cycles
going to UMA, mutexes, etc.  (This profile only shows the top functions, 
and has the call tree disabled ... ie only interrupt based profiling because it slows
the test down too much otherwise).

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 18.4      10.25    10.25                             cpu_idle_default [1]
 13.8      17.94     7.69                             cpu_idle [2]
  6.5      21.57     3.63                             critical_exit [3]
  6.5      25.17     3.61                             _mtx_lock_spin [4]
  5.0      27.95     2.78                             uma_zalloc_arg [5]
  4.6      30.52     2.56                             cpu_halt [6]
  4.4      32.94     2.43                             uma_zfree_arg [7]
  3.9      35.12     2.18                             maybe_preempt [8]
  3.2      36.91     1.79                             bridge_in [9]
  2.8      38.46     1.55                             em_process_receive_interrupts [10]
  2.6      39.89     1.43                             _bus_dmamap_load_buffer [11]
  2.3      41.19     1.30                             bdg_forward [12]
  2.3      42.48     1.29                             mb_free_ext [13]
  1.8      43.49     1.01                             malloc_type_freed [14]
  1.7      44.44     0.95                             ether_input [15]
  1.7      45.39     0.94                             em_start [16]
  1.7      46.33     0.94                             _bus_dmamap_sync [17]
  1.5      47.18     0.84                             em_start_locked [18]
  1.2      47.85     0.68                             malloc_type_zone_allocated [19]
  1.2      48.52     0.67                             __mcount [20]
  1.2      49.17     0.65                             mb_ctor_pack [21]
  1.1      49.80     0.63                             em_encap [22]
  1.1      50.39     0.59                             free [23]
  1.0      50.94     0.56                             bus_dmamap_load_mbuf [24]
  0.9      51.46     0.51                             generic_bzero [25]
  0.9      51.96     0.50                             m_freem [26]
  0.8      52.42     0.46                             generic_bcopy [27]
  0.7      52.79     0.38                             em_get_buf [28]
  0.6      53.13     0.34                             em_clean_transmit_interrupts [29]
  0.5      53.42     0.29                             bus_dmamap_load [30]
  0.4      53.66     0.24                             m_adj [31]
  0.4      53.90     0.23                             malloc [32]
  0.4      54.11     0.22                             bus_dmamap_create [33]
  0.2      54.24     0.12                             bus_dmamem_free [35]
  0.2      54.35     0.11                             mb_dtor_pack [36]
  0.2      54.45     0.10                             em_tx_cb [37]
  0.2      54.54     0.09                             em_receive_checksum [38]
  0.1      54.61     0.08                             em_dmamap_cb [39]
  0.1      54.69     0.07                             m_tag_delete_chain [40]
  0.1      54.75     0.07                             _bus_dmamap_unload [41]
  0.1      54.82     0.06                             em_poll [42]
  0.1      54.88     0.06                             em_transmit_checksum_setup [43]
  0.1      54.93     0.05                             bus_dmamap_destroy [44]
  0.1      54.97     0.04                             _mtx_lock_sleep [47]
  0.1      55.00     0.03                             if_start [49]
  0.1      55.03     0.03                             bus_dmamap_load_uio [50]
  0.1      55.07     0.03    75189     0.00     0.00  netisr_poll [51]
  0.1      55.10     0.03                             em_smartspeed [52]
  0.1      55.13     0.03                             ithread_loop [34]

Here are the (top) results of the mutex profiling (these are basically all the locks
that get called once or twice per packet):

max     total    count  avg     cnt_hold cnt_lock name
24344	37552473 309134   121	151712	101781	if_em.c:956 (em5)		(1)
31578	10548396 309131   34	44233	81751	if_em.c:3432 (em4)		(2)
460	5813698  620705    9	16	79	uma_core.c:1800 (UMA pcpu) 	(3)
428	4304975  619846    6	26	24	uma_core.c:2206 (UMA pcpu)	(4)
445	3129168  309127   10	30828	28115	bridge.c:1201 (em5)		(5)
462	3125131  309127   10	125294	122560	bridge.c:816 (bridge)		(6)
489	2815715  309134   9	14610	20050	if_em.c:926 (em5)		(7)
450	2573019  309170   8	94471	101577	kern_malloc.c:185 (devbuf)	(8)
419	2113089  309275    6	67982	65871	kern_malloc.c:210 (devbuf)	(9)

The line numbers will be close to RELENG_5_BP code but not exactly the same 
because of some local modifications, so here are the descriptions of the mutexes 
involved:
1) em_start  (used for transmit)
2) em_process_receive_interrupts (re-lock just after if_input)
3) uma_zalloc_arg (per CPU lock)
4) uma_zfree_arg (per CPU lock)
5) bdb_forward (IFQ_HANDOFF)
6) bridge_in (global bridge lock)
7) em_start_locked (IF_DEQUEUE)
8) malloc_type_zone_allocated
9) malloc_type_freed

>From these numbers, the uma locks seem to get called twice for every packet, 
but have no collisions.  All other locks have significant collision problems resulting
in a lot of overhead.

Based on these stats, I have come up with the following observations/suggestions/etc
that I would like to discuss.

As discussed before, there is a significant cost associated with every mutex.  I'd
like to be able to get down to less than 1 mutex per packet (on average) through this
path.  Some of the possibilities to do this are:
- Implement workQ's of packets (also suggested by Robert Watson in the past).  This
will reduce the mutexes in number 1, 2, 5, 6 & 7 above because it should be possible
to only take the lock for a queue of packets, instead of every one.
- Implement device level caching for the UMA mbuf zones.  If a driver could allocate
one bucket of mbufs at a time, no locking would be required per allocation.  The same
goes for the free side of things, if you can allocate an empty bucket, fill it up, and then
return it, only a couple of mutexes are required per bucket.  This would also reduce
the function call overhead for every packet.  This change should actually get rid
of most of the remaining mutex overhead.

I think that one of the major reasons that polling with one thread had about the same
performance as interrupts with 4 threads/cores is that some of the mutexes are held
far too long, thus reducing parallelism.  The biggest culprit of this is in the em driver.
First of all, there is only one global lock for the driver, but there should be no reason
that the rx & tx paths couldn't be run simultanously.  If we setup something like:
EM_TX_LOCK()
EM_TX_UNLOCK()
EM_RX_LOCK()
EM_RX_UNLOCK()
EM_LOCK() {EM_TX_LOCK(); EM_RX_LOCK()}
EM_UNLOCK() {EM_TX_UNLOCK(); EM_RX_UNLOCK()}
this driver will run much faster.  Even within the receive and transmit functions, 
the mutexes are held for a long time.  It should be possible to code in such a way
that the mutex is released before trying to free or allocate an mbuf.  This should
reduce the holding time and thus collisions a lot.

When overloading the bridge in interrupt mode, the system becomes completely
unresponsive (can't even get into ddb) until the packet source is removed.  This is
highly undesirable behaviour, but currently the only way to use multiple kernel 
threads to handle the workload.
Extending polling to use multiple threads instead of one should work around this
problem.  This is a bit of a design in itself, and probably worthy of a separate 
discussion.  We are certainly willing to give this a shot (hopefully with with some
external input)

The latest generation Xeons (Nocona) have a couple of new features that are
very useful for optimizing code.  One of them is the ability to prefetch a cache line
for which a page is not yet in the tlb.  It should be possible to strategically sprinkle
a few prefetches in the code, and get a big performance boost.  This is probably
pretty platform specific though, so I don't know how to do this in general because
it will only benefit some platforms (don't know about AMD/alpha), and may slightly
hurt some others.

In terms of cache efficiency, I am not sure that using the UMA mbuf packet zone
is the best way to go.  To be able to put a cluster on a DMA descriptor, you 
currently need to read the mbuf header to get its pointer.  It may be more efficient
to have the local cache of just clusters and mbufs.  To allocate a cluster you 
just need to read the bucket array, and can add the cluster to the descriptor without
having anything but the array itself in cache.  Once the packet is filled up, it can
be coupled to an mbuf header.  The other advantage of this is that pointers for
both are always easily available in an array, they lend themselves well to s/w 
prefetching.

The choice of schedulers, and use of PREEMPTION will probably make a bit of a 
difference for these tests too, but I did not do much experimentation because I 
couldn't even boot with the ULE scheduler & PREEMPTION enabled.  I suspect
that preemption will help quite a bit when there are mutex collisions.

This is all I have for now.  As I mentioned previously, I'd like to generate some 
discussion on some of these points, as well as hear ideas for additional optimizations.
We will definitely implement some of these features ourselves, but would much
rather give back the code and make this a "cooperative effort".
Also, I haven't done any testing on the netgraph side of things yet, but that will
probably be next on the list.
Comments?
Thanks,

Gerrit Nagelhout