Freebsd IP Forwarding performance (question, and some info)
[7-stable, current, em, smp]
Bruce Evans
brde at optusnet.com.au
Thu Jul 3 10:42:02 UTC 2008
On Thu, 3 Jul 2008, Paul wrote:
> Bruce Evans wrote:
>>> No polling:
>>> 843762 25337 52313248 1 0 178 0
>>> 763555 0 47340414 1 0 178 0
>>> 830189 0 51471722 1 0 178 0
>>> 838724 0 52000892 1 0 178 0
>>> 813594 939 50442832 1 0 178 0
>>> 807303 763 50052790 1 0 178 0
>>> 791024 0 49043492 1 0 178 0
>>> 768316 1106 47635596 1 0 178 0
>>> Machine is maxed and is unresponsive..
>>
>> That's the most interesting one. Even 1% packet loss would probably
>> destroy performance, so the benchmarks that give 10-50% packet loss
>> are uninteresting.
>>
> But you realize that it's outputting all of these packets on em3 and I'm
> watching them coming out
> and they are consistent with the packets received on em0 that netstat shows
> are 'good' packets.
Well, output is easier. I don't remember seeing the load on a taskq for
em3. If there is a memory bottleneck, it might to might not be more related
to running only 1 taskq per interrupt, depending on how independent the
memory system is for different CPU. I think Opterons have more indenpendence
here than most x86's.
> I'm using a server opteron which supposedly has the best memory performance
> out of any CPU right now.
> Plus opterons have the biggest l1 cache, but small l2 cache. Do you think
> larger l2 cache on the Xeon (6mb for 2 core) would be better?
> I have a 2222 opteron coming which is 1ghz faster so we will see what happens
I suspect lower latency memory would help more. Big memory systems
have inherently higher latency. My little old A64 workstation and
laptop have main memory latencies 3 times smaller than freebsd.org's
new Core2 servers according to lmbench2 (42 nsec for the overclocked
DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec).
If there are a lot of cache misses, then the extra 100 nsec can be
important. Profiling of sendto() using hwpmc or perfmon shows a
significant number of cache misses per packet (2 or 10?).
>>> Polling ON:
>>> input (em0) output
>>> packets errs bytes packets errs bytes colls
>>> 784138 179079 48616564 1 0 226 0
>>> 788815 129608 48906530 2 0 356 0
>>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really
>>> mistified by this..
>>
>> Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy
>> to explain (perhaps incorrectly). Polling can then read at most 256
>> descriptors every 1/2000 second, giving a max throughput of 512 kpps.
>> Packets < descriptors in general but might be equal here (for small
>> packets). You seem to actually get 784 kpps, which is too high even
>> in descriptors unless, but matches exactly if the errors are counted
>> twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40%
>> still happens to be left over after giving up at 512 kpps. Most of
>> the errors are probably handled by the hardware at low cost in CPU by
>> dropping packets. There are other types of errors but none except
>> dropped packets is likely.
>>
> Read above, it's actually transmitting 770kpps out of em3 so it can't just be
> 512kpps.
Transmitting is easier, but with polling its even harder to send faster than
hz * queue_length than to receive. This is without polling in idle.
> I was thinking of trying 4 or 5.. but how would that work with this new
> hardware?
Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally
has lower overheads and latency, but is missing important improvements
(mainly tcp optimizations in upper layers, better DMA and/or mbuf
handling, and support for newer NICs). FreeBSD-5 is also missing the
overhead+latency advantage.
Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a
2-line change to support a not-so-new PCI em NIC. Summary:
- my bge NIC can handle about 600 kpps on my faster machine, but only
achieves 300 in 4.10 unpatched.
- my em NIC can handle about 400 kpps on my slower machine, except in
later versions it can receive at about 600 kpps.
- only 6.x and later can achieve near wire throughput for 1500-MTU
packets (81 kpps vs 76 kpps). This depends on better DMA or mbuf
handling... I now remember the details -- it is mainly better mbuf
handling: old versions split the 1500-MTU packets into 2 mbufs and
this causes 2 descriptors per packet, which causes extra software
overheads and even larger overheads for the hardware.
%%%
Results of benchmarks run on 23 Feb 2007:
my~5.2 bge --> ~4.10 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 639 98 1660 398* 77 8k
ttcp -l5 -t 6.0 100 3960 6.0 6 5900
ttcp -l1472 -u -t 76 27 395 76 40 8k
ttcp -l1472 -t 51 40 11k 51 26 8k
(*) Same as sender according to netstat -I, but systat -ip shows that
almost half aren't delivered to upper layers.
my~5.2 bge --> 4.11 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 635 98 1650 399* 74 8k
ttcp -l5 -t 5.8 100 3900 5.8 6 5800
ttcp -l1472 -u -t 76 27 395 76 32 8k
ttcp -l1472 -t 51 40 11k 51 25 8k
(*) Same as sender according to netstat -I, but systat -ip shows that
almost half aren't delivered to upper layers.
my~5.2 bge --> my~5.2 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 638 98 1660 394* 100- 8k
ttcp -l5 -t 5.8 100 3900 5.8 9 6000
ttcp -l1472 -u -t 76 27 395 76 46 8k
ttcp -l1472 -t 51 40 11k 51 35 8k
(*) Same as sender according to netstat -I, but systat -ip shows that
almost half aren't delivered to upper layers. With the em rate
limit on ips changed from 8k to 80k, about 95% are delivered up.
my~5.2 bge --> 6.2 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 637 98 1660 637 100- 15k
ttcp -l5 -t 5.8 100 3900 5.8 8 12k
ttcp -l1472 -u -t 76 27 395 76 36 16k
ttcp -l1472 -t 51 40 11k 51 37 16k
my~5.2 bge --> ~current em-fastintr
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 641 98 1670 641 99 8k
ttcp -l5 -t 5.9 100 2670 5.9 7 6k
ttcp -l1472 -u -t 76 27 395 76 35 8k
ttcp -l1472 -t 52 43 11k 52 30 8k
~6.2 bge --> ~current em-fastintr
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 309 62 1600 309 64 8k
ttcp -l5 -t 4.9 100 3000 4.9 6 7k
ttcp -l1472 -u -t 76 27 395 76 34 8k
ttcp -l1472 -t 54 28 6800 54 30 8k
~current bge --> ~current em-fastintr
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t 602 100 1570 602 99 8k
ttcp -l5 -t 5.3 100 2660 5.3 5 5300
ttcp -l1472 -u -t 81# 19 212 81# 38 8k
ttcp -l1472 -t 53 34 11k 53 30 8k
(#) Wire speed to within 0.5%. This is the only kppps in this set of
benchmarks that is close to wire speed. Older kernels apparently
lose relative to -current because mbufs for mtu-sized packets are
not contiguous in older kernels.
Old results:
~4.10 bge --> my~5.2 em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t n/a n/a n/a 346 79 8k
ttcp -l5 -t n/a n/a n/a 5.4 10 6800
ttcp -l1472 -u -t n/a n/a n/a 67 40 8k
ttcp -l1472 -t n/a n/a n/a 51 36 8k
~4.10 kernel, =4 bge --> ~current em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t n/a n/a n/a 347 96 14k
ttcp -l5 -t n/a n/a n/a 5.8 10 14k
ttcp -l1472 -u -t n/a n/a n/a 67 62 14K
ttcp -l1472 -t n/a n/a n/a 52 40 16k
~4.10 kernel, =4+ bge --> ~current em
tx rx
kpps load% ips kpps load% ips
ttcp -l5 -u -t n/a n/a n/a 627 100 9k
ttcp -l5 -t n/a n/a n/a 5.6 9 13k
ttcp -l1472 -u -t n/a n/a n/a 68 63 14k
ttcp -l1472 -t n/a n/a n/a 54 44 16k
%%%
%%%
Results of benchmarks run on 28 Dec 2007:
~5.2 epsplex (em) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 825k 3 206k 229 412k 52.1 45.1 2.8
local with sink: 659k 3 263k 231 131k 66.5 27.3 6.2
tx remote no sink: 35k 3 273k 8237 266k 42.0 52.1 2.3 3.6
tx remote with sink: 26k 3 394k 8224 100 60.0 5.41 3.4 11.2
rx remote no sink: 25k 4 26 8237 373k 20.6 79.4 0.0 0.0
rx remote with sink: 30k 3 203k 8237 398k 36.5 60.7 2.8 0.0
6.3-PR besplex (em) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 417k 1 208k 418k 2 49.5 48.5 2.0
local with sink: 420k 1 276k 145k 2 70.0 23.6 6.4
tx remote no sink: 19k 2 250k 8144 2 58.5 38.7 2.8 0.0
tx remote with sink: 16k 2 361k 8336 2 72.9 24.0 3.1 4.4
rx remote no sink: 429 3 49 888 2 0.3 99.33 0.0 0.4
tx remote with sink: 13k 2 316k 5385 2 31.7 63.8 3.6 0.8
8.0-C epsplex (em-fast) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 442k 3 221k 230 442k 47.2 49.6 2.7
local with sink: 394k 3 262k 228 131k 72.1 22.6 5.3
tx remote no sink: 17k 3 226k 7832 100 94.1 0.2 3.0 0.0
tx remote with sink: 17k 3 360k 7962 100 91.7 0.2 3.7 4.4
rx remote no sink: saturated -- cannot update systat display
rx remote with sink: 15k 6 358k 8224 100 97.0 0.0 2.5 0.5
~4.10 besplex (bge) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 15 0 425k 228 11 96.3 0.0 3.7
local with sink: ** 0 622k 229 ** 94.7 0.3 5.0
tx remote no sink: 29 1 490k 7024 11 47.9 29.8 4.4 17.9
tx remote with sink: 26 1 635k 1883 11 65.7 11.4 5.6 17.3
rx remote no sink: 5 1 68 7025 1 0.0 47.3 0.0 52.7
rx remote with sink: 6679 2 365k 6899 12 19.7 29.2 2.5 48.7
~5.2-C besplex (bge) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 1M 3 271k 229 543k 50.7 46.8 2.5
local with sink: 1M 3 406k 229 203k 67.4 28.2 4.4
tx remote no sink: 49k 3 474k 11k 167k 52.3 42.7 5.0 0.0
tx remote with sink: 6371 3 641k 1900 100 76.0 16.8 6.2 0.9
rx remote no sink: 34k 3 25 11k 270k 0.8 65.4 0.0 33.8
rx remote with sink: 41k 3 365k 10k 370k 31.5 47.1 2.3 19.0
6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken):
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 540k 0 270k 540k 0 50.5 46.0 3.5
local with sink: 628k 0 417k 210k 0 68.8 27.9 3.3
tx remote no sink: 15k 1 222k 7190 1 28.4 29.3 1.7 40.6
tx remote with sink: 5947 1 315k 2825 1 39.9 14.7 2.6 42.8
rx remote no sink: 13k 1 23 6943 0 0.3 49.5 0.2 50.0
rx remote with sink: 20k 1 371k 6819 0 29.5 30.1 3.9 36.5
8.0-C besplex (bge) ttcp:
Csw Trp Sys Int Sof Sys Intr User Idle
local no sink: 649k 3 324k 100 649k 53.9 42.9 3.2
local with sink: 649k 3 433k 100 216k 75.2 18.8 6.0
tx remote no sink: 24k 3 432k 10k 100 49.7 41.3 2.4 6.6
tx remote with sink: 3199 3 568k 1580 100 64.3 19.6 4.0 12.2
rx remote no sink: 20k 3 27 10k 100 0.0 46.1 0.0 53.9
rx remote with sink: 31k 3 370k 10k 100 30.7 30.9 4.8 33.5
%%%
Bruce
More information about the freebsd-net
mailing list