Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Thu Jul 3 10:42:02 UTC 2008

On Thu, 3 Jul 2008, Paul wrote:

> Bruce Evans wrote:
>>> No polling:
>>> 843762 25337   52313248          1     0        178     0
>>>   763555     0   47340414          1     0        178     0
>>>   830189     0   51471722          1     0        178     0
>>>   838724     0   52000892          1     0        178     0
>>>   813594   939   50442832          1     0        178     0
>>>   807303   763   50052790          1     0        178     0
>>>   791024     0   49043492          1     0        178     0
>>>   768316  1106   47635596          1     0        178     0
>>> Machine is maxed and is unresponsive..
>> 
>> That's the most interesting one.  Even 1% packet loss would probably
>> destroy performance, so the benchmarks that give 10-50% packet loss
>> are uninteresting.
>> 
> But you realize that it's outputting all of these packets on em3  and I'm 
> watching them coming out
> and they are consistent with the packets received on em0 that netstat shows 
> are 'good' packets.

Well, output is easier.  I don't remember seeing the load on a taskq for
em3.  If there is a memory bottleneck, it might to might not be more related
to running only 1 taskq per interrupt, depending on how independent the
memory system is for different CPU.  I think Opterons have more indenpendence
here than most x86's.

> I'm using a server opteron which supposedly has the best memory performance 
> out of any CPU right now.
> Plus opterons have the biggest l1 cache, but small l2 cache.  Do you think 
> larger l2 cache on the Xeon (6mb for 2 core) would be better?
> I have a 2222 opteron coming which is 1ghz faster so we will see what happens

I suspect lower latency memory would help more.  Big memory systems
have inherently higher latency.  My little old A64 workstation and
laptop have main memory latencies 3 times smaller than freebsd.org's
new Core2 servers according to lmbench2 (42 nsec for the overclocked
DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec).
If there are a lot of cache misses, then the extra 100 nsec can be
important.  Profiling of sendto() using hwpmc or perfmon shows a
significant number of cache misses per packet (2 or 10?).

>>> Polling ON:
>>>         input          (em0)           output
>>>  packets  errs      bytes    packets  errs      bytes colls
>>>   784138 179079   48616564          1     0        226     0
>>>   788815 129608   48906530          2     0        356     0
>>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ?  I'm really 
>>> mistified by this..
>> 
>> Is this with hz=2000 and 256/256 and no polling in idle?  40% is easy
>> to explain (perhaps incorrectly).  Polling can then read at most 256
>> descriptors every 1/2000 second, giving a max throughput of 512 kpps.
>> Packets < descriptors in general but might be equal here (for small
>> packets).  You seem to actually get 784 kpps, which is too high even
>> in descriptors unless, but matches exactly if the errors are counted
>> twice (784 - 179 - 505 ~= 512).  CPU is getting short too, but 40%
>> still happens to be left over after giving up at 512 kpps.  Most of
>> the errors are probably handled by the hardware at low cost in CPU by
>> dropping packets.  There are other types of errors but none except
>> dropped packets is likely.
>> 
> Read above, it's actually transmitting 770kpps out of em3 so it can't just be 
> 512kpps.

Transmitting is easier, but with polling its even harder to send faster than
hz * queue_length than to receive.  This is without polling in idle.

> I was thinking of trying 4 or 5.. but how would that work with this new 
> hardware?

Poorly, except possibly with polling in FreeBSD-4.  FreeBSD-4 generally
has lower overheads and latency, but is missing important improvements
(mainly tcp optimizations in upper layers, better DMA and/or mbuf
handling, and support for newer NICs).  FreeBSD-5 is also missing the
overhead+latency advantage.

Here are some benchmarks. (ttcp mainly tests sendto().  4.10 em needed a
2-line change to support a not-so-new PCI em NIC.  Summary:
- my bge NIC can handle about 600 kpps on my faster machine, but only
   achieves 300 in 4.10 unpatched.
- my em NIC can handle about 400 kpps on my slower machine, except in
   later versions it can receive at about 600 kpps.
- only 6.x and later can achieve near wire throughput for 1500-MTU
   packets (81 kpps vs 76 kpps).  This depends on better DMA or mbuf
   handling...  I now remember the details -- it is mainly better mbuf
   handling: old versions split the 1500-MTU packets into 2 mbufs and
   this causes 2 descriptors per packet, which causes extra software
   overheads and even larger overheads for the hardware.

%%%
Results of benchmarks run on 23 Feb 2007:

my~5.2 bge --> ~4.10 em
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     639     98    1660     398*     77      8k
ttcp -l5       -t     6.0    100    3960     6.0       6    5900
ttcp -l1472 -u -t      76     27     395      76      40      8k
ttcp -l1472    -t      51     40     11k      51      26      8k

(*) Same as sender according to netstat -I, but systat -ip shows that
     almost half aren't delivered to upper layers.

my~5.2 bge --> 4.11 em
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     635     98    1650     399*     74      8k
ttcp -l5       -t     5.8    100    3900     5.8       6    5800
ttcp -l1472 -u -t      76     27     395      76      32      8k
ttcp -l1472    -t      51     40     11k      51      25      8k

(*) Same as sender according to netstat -I, but systat -ip shows that
     almost half aren't delivered to upper layers.

my~5.2 bge --> my~5.2 em
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     638     98    1660     394*    100-     8k
ttcp -l5       -t     5.8    100    3900     5.8       9    6000
ttcp -l1472 -u -t      76     27     395      76      46      8k
ttcp -l1472    -t      51     40     11k      51      35      8k

(*) Same as sender according to netstat -I, but systat -ip shows that
     almost half aren't delivered to upper layers.  With the em rate
     limit on ips changed from 8k to 80k, about 95% are delivered up.

my~5.2 bge --> 6.2 em
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     637     98    1660     637     100-    15k
ttcp -l5       -t     5.8    100    3900     5.8       8     12k
ttcp -l1472 -u -t      76     27     395      76      36     16k
ttcp -l1472    -t      51     40     11k      51      37     16k

my~5.2 bge --> ~current em-fastintr
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     641     98    1670     641      99      8k
ttcp -l5       -t     5.9    100    2670     5.9       7      6k
ttcp -l1472 -u -t      76     27     395      76      35      8k
ttcp -l1472    -t      52     43     11k      52      30      8k

~6.2 bge --> ~current em-fastintr
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     309     62    1600     309      64      8k
ttcp -l5       -t     4.9    100    3000     4.9       6      7k
ttcp -l1472 -u -t      76     27     395      76      34      8k
ttcp -l1472    -t      54     28    6800      54      30      8k

~current bge --> ~current em-fastintr
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     602    100    1570     602      99      8k
ttcp -l5       -t     5.3    100    2660     5.3       5    5300
ttcp -l1472 -u -t      81#    19     212      81#     38      8k
ttcp -l1472    -t      53     34     11k      53      30      8k

(#) Wire speed to within 0.5%.  This is the only kppps in this set of
     benchmarks that is close to wire speed.  Older kernels apparently
     lose relative to -current because mbufs for mtu-sized packets are
     not contiguous in older kernels.

Old results:

~4.10 bge --> my~5.2 em
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     n/a    n/a     n/a     346      79      8k
ttcp -l5       -t     n/a    n/a     n/a     5.4      10    6800
ttcp -l1472 -u -t     n/a    n/a     n/a      67      40      8k
ttcp -l1472    -t     n/a    n/a     n/a      51      36      8k

~4.10 kernel, =4 bge --> ~current em
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     n/a    n/a     n/a     347      96     14k
ttcp -l5       -t     n/a    n/a     n/a     5.8      10     14k
ttcp -l1472 -u -t     n/a    n/a     n/a      67      62     14K
ttcp -l1472    -t     n/a    n/a     n/a      52      40     16k

~4.10 kernel, =4+ bge --> ~current em
                              tx                      rx
                      kpps   load%    ips    kpps    load%    ips
ttcp -l5    -u -t     n/a    n/a     n/a     627     100      9k
ttcp -l5       -t     n/a    n/a     n/a     5.6       9     13k
ttcp -l1472 -u -t     n/a    n/a     n/a      68      63     14k
ttcp -l1472    -t     n/a    n/a     n/a      54      44     16k
%%%

%%%
Results of benchmarks run on 28 Dec 2007:

~5.2 epsplex (em) ttcp:
                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
local no sink:        825k    3 206k  229 412k     52.1  45.1   2.8
local with sink:      659k    3 263k  231 131k     66.5  27.3   6.2
tx remote no sink:     35k    3 273k 8237 266k     42.0  52.1   2.3   3.6
tx remote with sink:   26k    3 394k 8224  100     60.0  5.41   3.4  11.2
rx remote no sink:     25k    4   26 8237 373k     20.6  79.4   0.0   0.0
rx remote with sink:   30k    3 203k 8237 398k     36.5  60.7   2.8   0.0

6.3-PR besplex (em) ttcp:
                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
local no sink:        417k    1 208k 418k    2     49.5  48.5   2.0
local with sink:      420k    1 276k 145k    2     70.0  23.6   6.4
tx remote no sink:     19k    2 250k 8144    2     58.5  38.7   2.8   0.0
tx remote with sink:   16k    2 361k 8336    2     72.9  24.0   3.1   4.4
rx remote no sink:     429    3   49  888    2      0.3  99.33  0.0   0.4
tx remote with sink:   13k    2 316k 5385    2     31.7  63.8   3.6   0.8

8.0-C epsplex (em-fast) ttcp:
                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
local no sink:        442k    3 221k  230 442k     47.2  49.6   2.7
local with sink:      394k    3 262k  228 131k     72.1  22.6   5.3
tx remote no sink:     17k    3 226k 7832  100     94.1   0.2   3.0   0.0
tx remote with sink:   17k    3 360k 7962  100     91.7   0.2   3.7   4.4
rx remote no sink:     saturated -- cannot update systat display
rx remote with sink:   15k    6 358k 8224  100     97.0   0.0   2.5   0.5

~4.10 besplex (bge) ttcp:
                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
local no sink:          15    0 425k  228   11     96.3   0.0   3.7
local with sink:        **    0 622k  229   **     94.7   0.3   5.0
tx remote no sink:      29    1 490k 7024   11     47.9  29.8   4.4  17.9
tx remote with sink:    26    1 635k 1883   11     65.7  11.4   5.6  17.3
rx remote no sink:       5    1   68 7025    1      0.0  47.3   0.0  52.7
rx remote with sink:  6679    2 365k 6899   12     19.7  29.2   2.5  48.7

~5.2-C besplex (bge) ttcp:
                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
local no sink:          1M    3 271k  229 543k     50.7  46.8   2.5
local with sink:        1M    3 406k  229 203k     67.4  28.2   4.4
tx remote no sink:     49k    3 474k  11k 167k     52.3  42.7   5.0   0.0
tx remote with sink:  6371    3 641k 1900  100     76.0  16.8   6.2   0.9
rx remote no sink:     34k    3   25  11k 270k      0.8  65.4   0.0  33.8
rx remote with sink:   41k    3 365k  10k 370k     31.5  47.1   2.3  19.0

6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken):
                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
local no sink:        540k    0 270k 540k    0     50.5  46.0   3.5
local with sink:      628k    0 417k 210k    0     68.8  27.9   3.3
tx remote no sink:     15k    1 222k 7190    1     28.4  29.3   1.7  40.6
tx remote with sink:  5947    1 315k 2825    1     39.9  14.7   2.6  42.8
rx remote no sink:     13k    1   23 6943    0      0.3  49.5   0.2  50.0
rx remote with sink:   20k    1 371k 6819    0     29.5  30.1   3.9  36.5

8.0-C besplex (bge) ttcp:
                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
local no sink:        649k    3 324k  100 649k     53.9  42.9   3.2
local with sink:      649k    3 433k  100 216k     75.2  18.8   6.0
tx remote no sink:     24k    3 432k  10k  100     49.7  41.3   2.4   6.6
tx remote with sink:  3199    3 568k 1580  100     64.3  19.6   4.0  12.2
rx remote no sink:     20k    3   27  10k  100      0.0  46.1   0.0  53.9
rx remote with sink:   31k    3 370k  10k  100     30.7  30.9   4.8  33.5
%%%

Bruce