Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Thu Jul 3 07:26:21 UTC 2008

Bruce Evans wrote:
> On Wed, 2 Jul 2008, Paul wrote:
>
>> ...
>> -----------Reboot with 4096/4096........(my guess is that it will be 
>> a lot worse, more errors..)
>> ........
>> Without polling, 4096 is horrible, about 200kpps less ... :/
>> Turning on polling..
>> polling on, 4096 is bad,
>>           input          (em0)           output
>>  packets  errs      bytes    packets  errs      bytes colls
>>   622379 307753   38587506          1     0        178     0
>>   635689 277303   39412718          1     0        178     0
>> ...
>> ------Rebooting with 256/256 descriptors..........
>> ..........
>> No polling:
>> 843762 25337   52313248          1     0        178     0
>>   763555     0   47340414          1     0        178     0
>>   830189     0   51471722          1     0        178     0
>>   838724     0   52000892          1     0        178     0
>>   813594   939   50442832          1     0        178     0
>>   807303   763   50052790          1     0        178     0
>>   791024     0   49043492          1     0        178     0
>>   768316  1106   47635596          1     0        178     0
>> Machine is maxed and is unresponsive..
>
> That's the most interesting one.  Even 1% packet loss would probably
> destroy performance, so the benchmarks that give 10-50% packet loss
> are uninteresting.
>
But you realize that it's outputting all of these packets on em3  and 
I'm watching them coming out
and they are consistent with the packets received on em0 that netstat 
shows are 'good' packets.
> All indications are that you are running out of CPU and memory (DMA
> and/or cache fills) throughput.  The above apparently hits both limits
> at the same time, while with more descriptors memory throughput runs
> out first.  1 CPU is apparently barely enough for 800 kpps (is this
> all with UP now?), and I think more CPUs could only be slower, as you
> saw with SMP, especially using multiple em taskqs, since memory traffic
> would be higher.  I wouldn't expect this to be fixed soon (except by
> throwing better/different hardware at it).
>
> The CPU/DMA balance can probably be investigated by slowing down the CPU/
> memory system.
>
I'm using a server opteron which supposedly has the best memory 
performance out of any CPU right now.
Plus opterons have the biggest l1 cache, but small l2 cache.  Do you 
think larger l2 cache on the Xeon (6mb for 2 core) would be better?
I have a 2222 opteron coming which is 1ghz faster so we will see what 
happens :>
My NIC is PCI-E 4x so there's no bottleneck there.
> You may remember my previous mail about getting higher pps on bge.
> Again, all indications are that I'm running out of CPU, memory, and
> bus throughput too since the bus is only PCI 33MHz.  These interact
> in a complicated way which I haven't been able to untangle.  -current
> is fairly consistently slower than my ~5.2 by about 10%, apparently
> due to code bloat (extra CPU and related extra cache misses).  OTOH,
> like you I've seen huge variations for changes that should be null
> (e.g., disturbing the alignment of the text section without changing
> anything else).  My ~5.2 is very consistent since I rarely change it,
> while -current changes a lot and shows more variation, but with no
> sign of getting near the ~5.2 plateau or even its old peaks.
>
>> Polling ON:
>>         input          (em0)           output
>>  packets  errs      bytes    packets  errs      bytes colls
>>   784138 179079   48616564          1     0        226     0
>>   788815 129608   48906530          2     0        356     0
>>   755555 142997   46844426          2     0        468     0
>>   803670 144459   49827544          1     0        178     0
>>   777649 147120   48214242          1     0        178     0
>>   779539 146820   48331422          1     0        178     0
>>   786201 148215   48744478          2     0        356     0
>>   776013 101660   48112810          1     0        178     0
>>   774239 145041   48002834          2     0        356     0
>>   771774 102969   47850004          1     0        178     0
>>
>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ?  I'm 
>> really mistified by this..
>
> Is this with hz=2000 and 256/256 and no polling in idle?  40% is easy
> to explain (perhaps incorrectly).  Polling can then read at most 256
> descriptors every 1/2000 second, giving a max throughput of 512 kpps.
> Packets < descriptors in general but might be equal here (for small
> packets).  You seem to actually get 784 kpps, which is too high even
> in descriptors unless, but matches exactly if the errors are counted
> twice (784 - 179 - 505 ~= 512).  CPU is getting short too, but 40%
> still happens to be left over after giving up at 512 kpps.  Most of
> the errors are probably handled by the hardware at low cost in CPU by
> dropping packets.  There are other types of errors but none except
> dropped packets is likely.
>
Read above, it's actually transmitting 770kpps out of em3 so it can't 
just be 512kpps.
I suppose multiple packets can fit in 1 descriptor? I am using VERY 
small tcp packets..

>> Every time it maxes out and gets errors, top reports:
>> CPU:  0.0% user,  0.0% nice, 10.1% system, 45.3% interrupt, 44.6% idle
>> pretty much the same line every time
>>
>> 256/256 blows away 4096 , probably fits the descriptors into the 
>> cache lines on the cpu and 4096 has too many cache misses and causes 
>> worse performance.
>
> Quite likely.  Maybe your systems have memory systems that are weak 
> relative
> to other resources, so that they this limit sooner than expected.
>
> I should look at my "fixes" for bge, one than changes rxd from 256 to 
> 512,
> and one that increases the ifq tx length from txd = 512 to about 20000.
> Both of these might thrash caches.  The former makes little difference
> except for polling at < 4000 Hz, but I don't believe in or use polling.
> The latter works around select() for write descriptors not working on 
> sockets, so that high frequency polling from userland is not needed to
> determine a good time to retry after ENOBUFs errors.  This is probably
> only important in pps benchmarks.  txd = 512 gives good efficiency in
> my version of bge, but might be too high for good throughput and is 
> mostly
> wasted in distribution versions of FreeBSD.
>
I was thinking of trying 4 or 5.. but how would that work with this new 
hardware?

Thanks

Paul