Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Mon Jul 7 15:35:26 UTC 2008

On Mon, 7 Jul 2008, Andre Oppermann wrote:

> Paul,
>
> to get a systematic analysis of the performance please do the following
> tests and put them into a table for easy comparison:
>
> 1. inbound pps w/o loss with interface in monitor mode (ifconfig em0 
> monitor)
>...

I won't be running many of these tests, but found this one interesting --
I didn't know about monitor mode.  It gives the following behaviour:

-monitor ttcp receiving on bge0 at 397 kpps: 35% idle (8.0-CURRENT) 13.6 cm/p
  monitor ttcp receiving on bge0 at 397 kpps: 83% idle (8.0-CURRENT)  5.8 cm/p
-monitor ttcp receiving on em0  at 580 kpps:  5% idle (~5.2)        12.5 cm/p
  monitor ttcp receiving on em0  at 580 kpps: 65% idle (~5.2)         4.8 cm/p

cm/p = k8-dc-misses (bge0 system)
cm/p = k7-dc-misses (em0 system)

So it seems that the major overheads are not near the driver (as I already
knew), and upper layers are responsible for most of the cache misses.
The packet header is accessed even in monitor mode, so I think most of
the cache misses in upper layers are not related to the packet header.
Maybe they are due mainly to perfect non-locality for mbufs.

Other cm/p numbers:

ttcp sending on bge0 at 640 kpps: (~5.2)                11 cm/p
ttcp sending on bge0 at 580 kpps: (8.0-CURRENT)          9 cm/p
     (-current is 10% slower despite having lower cm/p.  This seems to be
     due to extra instructions executed)
ping -fq -c1000000 localhost at 171 kpps: (8.0-CURRENT) 12-33 cm/p
     (This is certainly CPU-bound.  lo0 is much slower than bge0.
     Latency (rtt) is 2 us.  It is 3 us in ~5.2 and was 4 in -current until
     very recently.)
ping -fq -c1000000 etherhost at  40 kpps: (8.0-CURRENT)    55 cm/p
     (The rate is quite low because flood ping doesn't actually flood.
     It tries to limit the rate to max(100, 1/latency), but it tends to
     go at a rate of ql(t)/latency where ql(t) is the average hardware
     queue length at the current time t.  ql(t) starts at 1 and builds up
     after a minute or 2 to a maximum of about 10 on my hardware.
     Latency is always ~100 us, so the average ql(t) must have been ~4.)
ping -fq -c1000000 etherhost at  20 kpps: (8.0-CURRENT)    45 cm/p
     (Another run to record the average latency (it was 121) showed high
     variance.)
netblast sending on bge0 at 582 kpps: (8.0-CURRENT)      9.8 cm/p
     (Packet blasting benchmarks actually flood, unlike flood ping.
     This is hard to implement, since select() for output-ready doesn't
     work.  netblast has to busy wait, while ttcp guesses how long to
     sleep but cannot sleep for a short enough interval unless queues
     are too large or hz is too small.  My systems are configured with
     HZ = 100 and snd.ifq too large so that sleeping for 1/Hz works for
     ttcp.  netblast still busy-waits.

     This gives an interesting difference for netblast.  It tries to send
     800 k packets in 1 second by only successfully sends 582 k.  9.8
     cm/p is for #misses / 582k.  The 300k unsuccessful sends apparently
     don't cause many cache misses.  But variance is high...)
ttcp sending on bge0 at 577 kpps: (8.0-CURRENT)         15.5 cm/p
     (Another run shows high variance.)
ttcp rates have low variance for a given kernel but high variance for
different kernels (an extra unrelated byte in the text section can
cause a 30% change).

High variance would also be explained by non-locality of mbufs.  Cycling
through lots of mbufs would maximize cache misses but random reuse of
mbufs would give variance.  Or the cycling and variance might be more
in general allocation.  There is sillyness in getsockaddr():  sendit()
calls getsockaddr() and getsockaddr() always uses malloc(), but
allocation on the stack works for at the call from sendit().  This
malloc() seemed to be responsible for a cache miss or two, but when I
changed it to use the stack the results were inconclusive.

Bruce