Polling tuning and performance

Fri Dec 15 23:11:31 PST 2006

On Fri, 15 Dec 2006, Alan Amesbury wrote:

> Bruce Evans wrote:

> ...
> The (extremely busy) interface is exclusively incoming traffic, received
> promiscuously.  Since that's provided enough clues as to what this box
> might actually be doing, I'll give away the secret:  It's running snort.
> :-)
>
>> I don't believe in POLLING or HZ=1000, but recetly tested them with
>> bge.  ...

> How are you benchmarking this?

Just by blasting packets, usually with ttcp.

> ...
> Well, I'm not exactly tied to polling.  I just tried it as an
> alternative and, for at least part of the time, it's performed better
> than non-polling.  I'm open to alternatives; I just want as close to
> zero loss as possible.

Polling is not working acceptably for me at all.  I'm testing on the
same network and machine that are serving nfs/udp.  Apparently, with
polling there is an i/o error evey few seconds even under light loads,
and of course errors are especially bad for nfs/udp (nfs seems to
recover but takes about 1 minute).

> ...
> [snip - "I've read polling(4) and it says..."]
>> I can (easily) generate only 250 kpps on input and had to increase
>> kern.polling.burst_max to > 250 to avoid huge packet lossage at this
>> rate.  It doesn't seem to work right for output, since I can (easily)
>> generate 340 kpps output and got that with a burst max of only 15
>> should have got only 150 kpps.  Output is faster at the lowest level
>> (but slower at higher levels), so doing larger bursts of output might
>> be intentional.  However, output at 340 kkps gives a system load of
>> 100% on the test machine (which is not very fast or SMP).  no matter
>> how it is done (polling just makes it go 2% faster), so polling is not
>> doing its main job of very well.  Polling's main job is to prevent
>> netowork activity from using 100% CPU.  Large values of
>> kern.polling.burst_max are fundamentally incompatible with polling
>> doing this.  On my test system, a burst max of 1000 combined with HZ
>> = 1000 would just ask the driver alone to use 100% of the CPU doing
>> 1000 kppps though a single device.  "Fortunately", the device can't
>> go that fast, so plenty of CPU is left.
>
> That's for sending, right?  In this case that's not an issue.  I simply
> have incoming traffic with MTUs of up to 9216 bytes that I want to
> *receive*.  Never mind the fact that bge(4) and the underlying hardware
> sucks in that it can't do that (although there's apparently a WinDOS
> driver that can do it on the same hardware?!).  Again, my focus is on
> sucking in packets as fast as possible with minimal loss.

Some bge hardware certainly supports jumbo frames.  Half of mine can, and
the other half is documented not to.

> ...
> If I understand you correctly, it sounds like I'd be better off without
> polling, particularly if there are *any* buffer limitations in the
> Broadcom hardware.  Again, it's not idle; the lowest recorded packet
> receive rate I've seen lately is around 40Kpkt/sec.  The lowest recorded
> rate was around 16Kpkt/sec.

No, you seem to have the fairly specialized but common application where
polling currently works better, except for the problem with packet loss
which we don't completely understand but seems to be related to thread
priorities.

>>>     * With polling on, kern.polling.burst_max=150:
>>>
>>>       - kern.polling.burst holds at 150
>>>       - 'vmstat 5' shows context switches hold around 2600, with
>>>         interrupts holding around 30K
>>
>> I think you mean `systat -vmstat 5'.  The interrupt count here is bogus.
>
> No, I mean 'vmstat 5'.  I just let it dump a line every five seconds and
> watch what happens.  Context switches and interrupts are both shown.
> The 'systat' version, in this case, is harder for me to read; it also
> lacks the scrolling history of 'vmstat'.  Sample output taken while
> writing this (note that the first line is almost always bogus and sorry
> if wrap is borked):

Ah, I forgot that I fixed some interrupt counting only in -current to
get a useful interrupt count in vmstat.  Software interrupts are still
put in the global interrupt count (but not in the software interrupt
count) in RELENG_6.  This makes them show up in vmstat output, and in
many configurations they dominate the global count so this count becomes
unrelated to the actual interrupt load.  In -current they are counted
as software interrupts only.  systat -vmstat reports interrupt counts
in finer detail so it is possible to determine various subcounts by
adding or subtracting the other counts.

>> ...
>>>       - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
>>>         doesn't increase!), other rates stay the same (looks like
>>>         possible display bugs in 'vmstat -i' here!)
>>
>> Probably just averaging.
>
> See, I'm not sure about that.  I thought that the whole point of polling
> was to avoid interrupts.  Since the total count doesn't increase for
> bge1 in 'vmstat -i' output, I interpreted it as a bug.

It's probably just the bogus software interrupt count.  Apparently, polling
generates 20-30 software interrupts per poll.  I don't know why it
generates so many, but the context switch count shows that most of them
don't generate a context switch, so most of them don't take much time.
Both software interrupts and hardware interrupts are currently counted
when they are requested, not when they delivered.  This is dubious but
works out OK for hardware interrupts only.  For hardware interupts,
even requests have a large overhead so requests that will coalesce
should be counted somewhere, but for software interrupts, requests have
a low overhead so the only reason to count requests that will coalesce
is to find and fix callers that make them.  I think that for hardware
interrupts, requests that will coalesce are rare in practice since the
first requst blocks subsequent ones.

>> I see the folowing packet loss for polling with HZ=1000, burst_max=300,
>> idle_poll=1:
>>
>> %%%
>>             input         (bge0)           output
>>    packets  errs      bytes    packets  errs      bytes colls
>>     242999     1   14579940          0     0          0     0
>>     235496     0   14129760          0     0          0     0
>>     236930  3261   14215800          0     0          0     0
>>     237816  3400   14268960          0     0          0     0
>>     240418  3211   14425080          0     0          0     0
>> %%%
>
> Well, I guess I'm doing OK, then.  With the same settings as above:
>
> amesbury at scoop % netstat -I bge1 -w 5
>            input         (bge1)           output
>   packets  errs      bytes    packets  errs      bytes colls
>    614710     0  513122698          0     0          0     0
>    662633     0  556662669          0     0          0     0
>    639052     0  530704135          0     0          0     0
>    706713     0  576938553          0     0          0     0
>    690495     0  554269218          0     0          0     0
>    682868     0  560234712          0     0          0     0
>    692268     0  562487939          0     0          0     0
>    680498     0  549782169          0     0          0     0
> ^C

Yes, I used -w 1 so my pps is about twice as much as yours, but I also use
tiny packets so as to get that high rate on low-end hardware, and that gives 
a bandwidth that is about 1/8 of yours.

> Then again, it's after 1830 on a Friday afternoon, so traffic loads have
> dropped a bit, so it's quite possible I'm not seeing anything dropped
> here because of this relatively lighter load.

Problems are certainly more likely with higher pps.  140 kpps is quite
small.  I can almost reach that with tiny packets on an 100Mbps network.

> In spite of the momentary 0% loss, do you think switching to an em(4),
> sk(4), or other card might help?  The bge(4) interfaces are integrated
> PCIe, and I think only PCI-X slots are available.

I believe em is (only slightly?) better but haven't used it.  The bus
matters most unless the card is really stupid.

Bruce