Polling tuning and performance

Fri Dec 15 06:18:04 PST 2006

On Thu, 14 Dec 2006, Alan Amesbury wrote:

> ...
> What I'm aiming for, of course, is zero packet loss.  Realizing that's
> probably impossible for this system given its load, I'm trying to do
> what I can to minimize loss.
> ...
> 	* PREEMPTION disabled - /sys/conf/NOTES says this helps with
> 	  interactivity.  I don't care about interactive performance
> 	  on this host.

It's needed to prevent packet loss without polling.  It probably makes
little difference with polling (if the machines is mostly handling
network traffic and that only by polling).

> 	* Most importantly, HZ=1000, and DEVICE_POLLING and
> 	  AUTO_EOI_1 are included.  (AUTO_EOI_1 was added because
> 	  /sys/amd64/conf/NOTES says this can save a few microseconds
> 	  on some interrupts.  I'm not worried about suspend/resume, but
> 	  definitely want speed, so it got added.

I don't believe in POLLING or HZ=1000, but recetly tested them with
bge.  I am unhappy to report that my fine-tuned interrupt handling
still loses to polling by a few percent for efficiency.  I am happy
to report that polling loses to interrupt handling by a lot for
correctness -- polling gives packet loss.  Polling also loses big for
latency with idle_poll and the system actually idle, when it wins a
little.

AUTO_EOI_1 has little effect unless the system gets lots of interrupt,
so with most interrupts avoided by using polling it has little effect.

> As mentioned above, this host is running FreeBSD/amd64, so there's no
> need to remove support for I586_CPU, et al; that stuff was never there
> in the first place.

AUTO_EOI_1 is also only used in non-apic mode, but non-apic mode is very
unusual for amd64 so AUTO_EOI_1 probably has no effect for you.

> As mentioned above, I've got HZ set to 1000.  Per /sys/amd64/conf/NOTES,
> I'd considered setting it to 2000, but have discovered previously that
> FreeBSD's RFC1323 support breaks.  I documented this on -hackers last year:
>
> http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html

I think there are old PRs about this.  Even 1000 is too large (?).

> Since I've not seen word on a correction for this being added to
> FreeBSD, I've limited HZ to 1000.

HZ = 100 gives interesting behaviour.  Of course, it doesn't work, since
polling depends on polling often enough.  Any particular value of HZ can
only give polling often enough for a very limited range of systems.  1000
is apparently good for 100Mbps and not too bad for 1Gbps, provided the
hardware has enough buffering, but with enough buffering polling is
not really needed.

> After reading polling(4) a couple times, I set kern.polling.burst_max to
> 1000.  The manpage says that "each interface can receive at most (HZ *
> burst_max) packets per second", and the default setting is 150, which is
> described as "adequate for 100Mbit network and HZ=1000."  I figured,
> "Hey, gigabit, how about ten times the default?" but that's prevented by
> "#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c.

I can (easily) generate only 250 kpps on input and had to increase
kern.polling.burst_max to > 250 to avoid huge packet lossage at this
rate.  It doesn't seem to work right for output, since I can (easily)
generate 340 kpps output and got that with a burst max of only 15
should have got only 150 kpps.  Output is faster at the lowest level
(but slower at higher levels), so doing larger bursts of output might
be intentional.  However, output at 340 kkps gives a system load of
100% on the test machine (which is not very fast or SMP).  no matter
how it is done (polling just makes it go 2% faster), so polling is not
doing its main job of very well.  Polling's main job is to prevent
netowork activity from using 100% CPU.  Large values of
kern.polling.burst_max are fundamentally incompatible with polling
doing this.  On my test system, a burst max of 1000 combined with HZ
= 1000 would just ask the driver alone to use 100% of the CPU doing
1000 kppps though a single device.  "Fortunately", the device can't
go that fast, so plenty of CPU is left.

> In theory that might've been good enough, but polling(4) says that
> kern.polling.burst is "[the] [m]aximum number of packets grabbed from
> each network interface in each timer tick.  This number is dynamically
> adjusted by the kernel, according to the programmed user_frac,
> burst_max, CPU speed, and system load."  I keep seeing
> kern.polling.burst hit a thousand, which leads me to believe that
> kern.polling.burst_max needs to be higher.
>
> For example:
>
> 	secs since
> 	  epoch	      kern.polling.burst
> 	----------    ------------------
> 	1166133997       1000
> ...

Is it really dynamic?  I see 1000's too, but for sending at only 340 kpps.
Almost all bursts should have size 340.   With a max of 150, burst is
150 too but 340 kpps are still sent.

> Unfortunately, that appears to be only possible through a) patching
> /sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000,
> as indicated in one of the NOTES, which will effectively hose certain
> TCP connectivity because of the RFC1323 breakage.  Looked at another
> way, both essentially require changes to source code, the former being
> fairly obvious, and the latter requiring fixes to the RFC1323 support.
> Either way, I think that's a bit beyond my abilities; I have NO
> illusions about my kernel h4cking sk1llz.

There may be a fix in an old PR.

> Other possibly relevant data points:
>
> 	* System load hovers right around 1.

Polling in idle eats all the CPU.  Polling in idle is very wasteful (mainly
of power) unless the system can rarely be idle anyway, but then polling
in idle doesn't help much.

> 	* The system has almost zero disk activity.
>
> 	* With polling off:
>
> 	  - 'vmstat 5' consistently shows about 13K context switches
> 	    and ~6800 interrupts
> 	  - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286
> 	    for bge1, and near zero for everything else
> 	  - CPU load drops to 0.4-0.8, but CPU idle time sits around 80%

These are only small interrupt loads.  bge always generates about 6667
interrupts per second (under all loads except none or tiny) because it
is programmed to use interrupt moderation with a timeout of 150uS and
some finer details.  This gives behaviour very similar to polling at a
frequency of 6667 Hz.  The main differences between this and polling at
1000 Hz are:
- 6667 Hz works better for correctness (lower latency, fewer dropped
   packets for missed polls)
- 6667 Hz has higher overheads (only a few percent)
- interrupts have lower overheads if nothing is happening so you don't
   actually get them at 6667 Hz
- the polling given by interrupt moderation is dumb.  It doesn't have
   any of the burst max controls, etc. (but could easily).  It doesn't
   interact with other devices (but could uneasily).

bge can be easily be reprogrammed to use interrupt moderation with a
timeout of 1000uS, so interrupt mode works mote like polling at 1000Hz.
This immediately gives the main disadvantage of polling (latency of
1000uS unless polling in idle and the system is actually idle at least
once every 1000uS).  bge has internal (buffering) limits which have
similar effects to the burst limit.   The advantages of polling are
not easily gained in this way (especially for rx).

> 	* With polling on, kern.polling.burst_max=150:
>
> 	  - kern.polling.burst holds at 150
> 	  - 'vmstat 5' shows context switches hold around 2600, with
> 	    interrupts holding around 30K

I think you mean `systat -vmstat 5'.  The interrupt count here is bogus.
It is mostly for software interrupts that mostly don't do much becuase
they coalesce with old ones.  Only ones that cause context switches are
relevant, and there is no counter for those.  Most of the context switches
are to the poll routine (1000 there and 1000 back).

> 	  - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
> 	    doesn't increase!), other rates stay the same (looks like
> 	    possible display bugs in 'vmstat -i' here!)

Probably just averaging.

> 	  - CPU load holds at 1, but CPU idle time usually stays >95%

I saw heavy polling reduce the idle time significantly here.  I think
the CPU idle time can be very biased here under light loads.  The times
shown by top(1) are unbiased.

> 	* With polling on, kern.polling.burst_max=1000:
>
> 	  - kern.polling.burst is frequently 1000 and almost always >850
> 	  - 'vmstat 5' shows context switches unchanged, but interrupts
> 	    are 150K-190K
> 	  - 'vmstat -i' unchanged from burst_max=150
> 	  - CPU load and CPU idle time very similar to burst_max=150
>
> So, with all that in mind.....  Any ideas for improvement?  Apologies in
> advance for missing the obvious.  'dmesg' and kernel config are attached.

Sorry, no ideas about tuning polling parameters (I don't know them well
since I don't believe in polling :-).  You apparently have eveything tuned
almost as well as possible, and the only possibilities for future
improvments are avoiding the 5% (?) extra overhead for !polling and
the packet loss for polling.

I see the folowing packet loss for polling with HZ=1000, burst_max=300,
idle_poll=1:

%%%
             input         (bge0)           output
    packets  errs      bytes    packets  errs      bytes colls
     242999     1   14579940          0     0          0     0
     235496     0   14129760          0     0          0     0
     236930  3261   14215800          0     0          0     0
     237816  3400   14268960          0     0          0     0
     240418  3211   14425080          0     0          0     0
%%%

The packet losses of 3+K always occur when I hit Caps Lock.  This also
happens without polling unless PREEMPTION is configuered.  It is caused
by low-quality code for setting the LED for Caps Lock combined with
thread priorities and or their scheduling not working right.  In the
interrupt-driven case, the thread priorities are correct (bgeintr >
syscons) and configuring PREEMPTION fixes the schedulng.  In the polling
case, the thread priorities are apparently incorrect.  Polling probably
needs to have its own thread running at the same priority as bgeintr
(> syscons), but I think it mainly uses the network SWI thread (<
syscons).  With idle_poll=1, it also uses its idlepoll thread, but
that has very low priority so it cannot help in cases like this.  The
code for setting LEDs busy-waits for several mS which is several polling
periods.  It must be about 13mS to lose 3200 packets when packets
are arriving at 240 kpps.

With a network server you won't be hitting Caps Lock a lot but have to
worry about other low-quality interrupt handlers busy-waiting for several
mS.

The loss of a single packet in the above happens more often than I can
explain:
- with polling, it happens a lot
- without polling but with PREEMPTION, it happens a lot when I press
   Caps Lock but not otherwise.
THe problem might not be packet loss.  bge has separate statistics for
packet loss but the net layer counts all intput errors together.

Bruce