Polling tuning and performance

Fri Dec 15 17:39:02 PST 2006

Bruce, thanks for taking time to read and reply.  For brevity, I've
removed my own earlier writings, (usually) annotating what's missing.

Bruce Evans wrote:

[snip - PREEMPTION stuff]
> It's needed to prevent packet loss without polling.  It probably makes
> little difference with polling (if the machines is mostly handling
> network traffic and that only by polling).

I should've noted in my original posting that 'vmstat' also reports very
little activity in the various paging columns; faults, pages in/out,
reclaims, freed, and pages scanned usually sit very close to or at zero.
 Disk operations as reported by 'vmstat' also sit almost completely at zero.

The (extremely busy) interface is exclusively incoming traffic, received
promiscuously.  Since that's provided enough clues as to what this box
might actually be doing, I'll give away the secret:  It's running snort.
 :-)

> I don't believe in POLLING or HZ=1000, but recetly tested them with
> bge.  I am unhappy to report that my fine-tuned interrupt handling
> still loses to polling by a few percent for efficiency.  I am happy
> to report that polling loses to interrupt handling by a lot for
> correctness -- polling gives packet loss.  Polling also loses big for
> latency with idle_poll and the system actually idle, when it wins a
> little.

How are you benchmarking this?

> AUTO_EOI_1 has little effect unless the system gets lots of interrupt,
> so with most interrupts avoided by using polling it has little effect.
> 
>> As mentioned above, this host is running FreeBSD/amd64, so there's no
>> need to remove support for I586_CPU, et al; that stuff was never there
>> in the first place.
> 
> AUTO_EOI_1 is also only used in non-apic mode, but non-apic mode is very
> unusual for amd64 so AUTO_EOI_1 probably has no effect for you.

Good to know.  "No effect" is still acceptable.  I just didn't want to
cause "negative effect."  :-)

[snip - Broken FreeBSD RFC1323/PAWS support at high HZ]
> I think there are old PRs about this.  Even 1000 is too large (?).

We noticed it when 'scrub all tcp reassemble' in FreeBSD 6.x's PF
started tossing packets.  The problem (mostly?) went away when we
dropped from HZ=2000 to HZ=1000, so we considered that a marginally
acceptable work-around for this FreeBSD bug.  However, since we a) have
gigabit-connected PF firewalls; b) want to consider following the advice
in NOTES about HZ=2000 for busy firewalls; and c) really prefer to run
off stock FreeBSD source unless absolutely impossible, we're sort of
interested in seeing a fix for RFC1323 get officially applied to FreeBSD.

About a year ago I pointed out patch had been submitted.  A commit-bit
responder acknowledged it, but said he wanted to do it differently.
Since I'm not really in a position to pay and don't have a more
acceptable patch of my own to submit, I've not really squawked about it.

>> Since I've not seen word on a correction for this being added to
>> FreeBSD, I've limited HZ to 1000.
> 
> HZ = 100 gives interesting behaviour.  Of course, it doesn't work, since
> polling depends on polling often enough.  Any particular value of HZ can
> only give polling often enough for a very limited range of systems.  1000
> is apparently good for 100Mbps and not too bad for 1Gbps, provided the
> hardware has enough buffering, but with enough buffering polling is
> not really needed.

Well, I'm not exactly tied to polling.  I just tried it as an
alternative and, for at least part of the time, it's performed better
than non-polling.  I'm open to alternatives; I just want as close to
zero loss as possible.

[snip - "I've read polling(4) and it says..."]
> I can (easily) generate only 250 kpps on input and had to increase
> kern.polling.burst_max to > 250 to avoid huge packet lossage at this
> rate.  It doesn't seem to work right for output, since I can (easily)
> generate 340 kpps output and got that with a burst max of only 15
> should have got only 150 kpps.  Output is faster at the lowest level
> (but slower at higher levels), so doing larger bursts of output might
> be intentional.  However, output at 340 kkps gives a system load of
> 100% on the test machine (which is not very fast or SMP).  no matter
> how it is done (polling just makes it go 2% faster), so polling is not
> doing its main job of very well.  Polling's main job is to prevent
> netowork activity from using 100% CPU.  Large values of
> kern.polling.burst_max are fundamentally incompatible with polling
> doing this.  On my test system, a burst max of 1000 combined with HZ
> = 1000 would just ask the driver alone to use 100% of the CPU doing
> 1000 kppps though a single device.  "Fortunately", the device can't
> go that fast, so plenty of CPU is left.

That's for sending, right?  In this case that's not an issue.  I simply
have incoming traffic with MTUs of up to 9216 bytes that I want to
*receive*.  Never mind the fact that bge(4) and the underlying hardware
sucks in that it can't do that (although there's apparently a WinDOS
driver that can do it on the same hardware?!).  Again, my focus is on
sucking in packets as fast as possible with minimal loss.

[snip - watching kern.polling.burst values]
> Is it really dynamic?  I see 1000's too, but for sending at only 340 kpps.
> Almost all bursts should have size 340.   With a max of 150, burst is
> 150 too but 340 kpps are still sent.

I haven't tested sending.  kern.polling.burst tends to hand at whatever
kern.polling.burst_max is set to.

[snip - writing kernel patches exceeds my expertise]
> There may be a fix in an old PR.

I'll look again.

[snip - load hovers at 1]
> Polling in idle eats all the CPU.  Polling in idle is very wasteful (mainly
> of power) unless the system can rarely be idle anyway, but then polling
> in idle doesn't help much.

This system is expected to NEVER be idle... except if it loses power.  :-)

[snip - other system stats]
> These are only small interrupt loads.  bge always generates about 6667
> interrupts per second (under all loads except none or tiny) because it
> is programmed to use interrupt moderation with a timeout of 150uS and
> some finer details.  This gives behaviour very similar to polling at a
> frequency of 6667 Hz.  The main differences between this and polling at
> 1000 Hz are:
> - 6667 Hz works better for correctness (lower latency, fewer dropped
>   packets for missed polls)
> - 6667 Hz has higher overheads (only a few percent)
> - interrupts have lower overheads if nothing is happening so you don't
>   actually get them at 6667 Hz
> - the polling given by interrupt moderation is dumb.  It doesn't have
>   any of the burst max controls, etc. (but could easily).  It doesn't
>   interact with other devices (but could uneasily).
> 
> bge can be easily be reprogrammed to use interrupt moderation with a
> timeout of 1000uS, so interrupt mode works mote like polling at 1000Hz.
> This immediately gives the main disadvantage of polling (latency of
> 1000uS unless polling in idle and the system is actually idle at least
> once every 1000uS).  bge has internal (buffering) limits which have
> similar effects to the burst limit.   The advantages of polling are
> not easily gained in this way (especially for rx).

If I understand you correctly, it sounds like I'd be better off without
polling, particularly if there are *any* buffer limitations in the
Broadcom hardware.  Again, it's not idle; the lowest recorded packet
receive rate I've seen lately is around 40Kpkt/sec.  The lowest recorded
rate was around 16Kpkt/sec.

>>     * With polling on, kern.polling.burst_max=150:
>>
>>       - kern.polling.burst holds at 150
>>       - 'vmstat 5' shows context switches hold around 2600, with
>>         interrupts holding around 30K
> 
> I think you mean `systat -vmstat 5'.  The interrupt count here is bogus.

No, I mean 'vmstat 5'.  I just let it dump a line every five seconds and
watch what happens.  Context switches and interrupts are both shown.
The 'systat' version, in this case, is harder for me to read; it also
lacks the scrolling history of 'vmstat'.  Sample output taken while
writing this (note that the first line is almost always bogus and sorry
if wrap is borked):

% vmstat 5
 procs      memory      page                   disk   faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr ad4   in   sy  cs us sy id
 2 0 0 1898784 1256124   13   0   0   0  12   0   0  647  291 552  8 15 78
 1 0 0 1898784 1256124    1   0   0   0   0   0   0 183135   97 2432  9
 4 87
 1 0 0 1898784 1256124    0   0   0   0   0   0   0 183370  116 2423 11
 5 84
 1 0 0 1898784 1256124    0   0   0   0   0   0   0 183455  100 2454  8
 5 87
 1 0 0 1898784 1256124    0   0   0   0   0   0   0 170236  105 2437  8
 4 88
 0 1 0 1898784 1256124    0   0   0   0   0   0   0 183183  108 2469 10
 5 84
^C

Settings:

	* Polling enabled on the high traffic interface
	* kern.polling.user_frac=20
	* kern.polling.burst_max=1000

> It is mostly for software interrupts that mostly don't do much becuase
> they coalesce with old ones.  Only ones that cause context switches are
> relevant, and there is no counter for those.  Most of the context switches
> are to the poll routine (1000 there and 1000 back).
> 
>>       - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
>>         doesn't increase!), other rates stay the same (looks like
>>         possible display bugs in 'vmstat -i' here!)
> 
> Probably just averaging.

See, I'm not sure about that.  I thought that the whole point of polling
was to avoid interrupts.  Since the total count doesn't increase for
bge1 in 'vmstat -i' output, I interpreted it as a bug.

>>       - CPU load holds at 1, but CPU idle time usually stays >95%
> 
> I saw heavy polling reduce the idle time significantly here.  I think
> the CPU idle time can be very biased here under light loads.  The times
> shown by top(1) are unbiased.

As mentioned before, though, this system is expected to NEVER be idle,
so a fast polling loop shouldn't be a liability.

[snip - more stats; "room for improvement?"]
> Sorry, no ideas about tuning polling parameters (I don't know them well
> since I don't believe in polling :-).  You apparently have eveything tuned
> almost as well as possible, and the only possibilities for future
> improvments are avoiding the 5% (?) extra overhead for !polling and
> the packet loss for polling.
> 
> I see the folowing packet loss for polling with HZ=1000, burst_max=300,
> idle_poll=1:
> 
> %%%
>             input         (bge0)           output
>    packets  errs      bytes    packets  errs      bytes colls
>     242999     1   14579940          0     0          0     0
>     235496     0   14129760          0     0          0     0
>     236930  3261   14215800          0     0          0     0
>     237816  3400   14268960          0     0          0     0
>     240418  3211   14425080          0     0          0     0
> %%%

Well, I guess I'm doing OK, then.  With the same settings as above:

amesbury at scoop % netstat -I bge1 -w 5
            input         (bge1)           output
   packets  errs      bytes    packets  errs      bytes colls
    614710     0  513122698          0     0          0     0
    662633     0  556662669          0     0          0     0
    639052     0  530704135          0     0          0     0
    706713     0  576938553          0     0          0     0
    690495     0  554269218          0     0          0     0
    682868     0  560234712          0     0          0     0
    692268     0  562487939          0     0          0     0
    680498     0  549782169          0     0          0     0
^C

Then again, it's after 1830 on a Friday afternoon, so traffic loads have
dropped a bit, so it's quite possible I'm not seeing anything dropped
here because of this relatively lighter load.

> The packet losses of 3+K always occur when I hit Caps Lock.  This also
> happens without polling unless PREEMPTION is configuered.  It is caused
> by low-quality code for setting the LED for Caps Lock combined with
> thread priorities and or their scheduling not working right.  In the
> interrupt-driven case, the thread priorities are correct (bgeintr >
> syscons) and configuring PREEMPTION fixes the schedulng.  In the polling
> case, the thread priorities are apparently incorrect.  Polling probably
> needs to have its own thread running at the same priority as bgeintr
> (> syscons), but I think it mainly uses the network SWI thread (<
> syscons).  With idle_poll=1, it also uses its idlepoll thread, but
> that has very low priority so it cannot help in cases like this.  The
> code for setting LEDs busy-waits for several mS which is several polling
> periods.  It must be about 13mS to lose 3200 packets when packets
> are arriving at 240 kpps.
> 
> With a network server you won't be hitting Caps Lock a lot but have to
> worry about other low-quality interrupt handlers busy-waiting for several
> mS.
> 
> The loss of a single packet in the above happens more often than I can
> explain:
> - with polling, it happens a lot
> - without polling but with PREEMPTION, it happens a lot when I press
>   Caps Lock but not otherwise.
> THe problem might not be packet loss.  bge has separate statistics for
> packet loss but the net layer counts all intput errors together.

Fortunately this machine doesn't even have a keyboard attached, so
there'll be no Caps games on it.  :-)

In spite of the momentary 0% loss, do you think switching to an em(4),
sk(4), or other card might help?  The bge(4) interfaces are integrated
PCIe, and I think only PCI-X slots are available.

Again, thanks for the sanity checking and additional information.

-- 
Alan Amesbury
OIT Security and Assurance
University of Minnesota