Interrupts + Polling mode (similar to Linux's NAPI)

Thu Apr 23 19:12:18 UTC 2009

On Fri, Mar 27, 2009 at 11:05:00AM +0000, Andrew Brampton wrote:

> 2009/3/27 Luigi Rizzo <rizzo at iet.unipi.it>:
> > The load of polling is pretty low (within 1% or so) even with
> > polling. The advantage of having interrupts is faster response
> > to incoming traffic, not CPU load.
> 
> oh, I was under the impression that polling spun in a tight loop, thus
> using 100% of the processor. After a quick test I see this is not the
> case. I assume it will get to 100% CPU load if I saturate my network.

Yes, polling has a limit on the maximum CPU time it will use, and also
will use less than the limit if there is no traffic.

There are a number of sysctls under kern.polling that control its
behaviour:

* kern.polling.user_frac: Desired user fraction of cpu time

This attempts to reserve at least a specified percentage of available
CPU time for user processes; polling tries to limit its percentage use
to 100 less this value.

* kern.polling.burst: Current polling burst size
* kern.polling.burst_max: Max Polling burst size
* kern.polling.each_burst: Max size of each burst

These three control the number of packets that polling processes per
call / tick.  Packets are processed in batches of each_burst, up to
burst packets total per tick.  The value of burst is capped at
busrt_max.

In order to keep the user_frac CPU percentage available for non-polling,
a feedback loop is used that controls the value of burst.  Each time a
bach of packets is processed, burst is incremented or decremented by 1,
depending on how much CPU time polling actually used.  In addition, if
polling extends beyond the next tick it's scaled back to 7/8ths of the
current value.

Polling was originally implemented as a livelock-avoidance technique
for the single-core case -- the primary goal is to guarantee the
availability of CPU time specified in user_frac.  The current algorithm
does not behave that well if user_frac is set low.  Setting it low is
reasonable if the workload is largely in-kernel (i.e., bridging or
routing), or when running SMP.

Another downside of the current implementation is that interfaces will
be polled multiple times per tick (burst / each_burst times), even if
there are no packets to process.

At work we've developed a replacement polling algorithm that keeps track
of the actual amount of time spent per packet, and uses that as the
feedback to control the number of packets in each batch.

This work requires a change to the polling KPI: the polling handlers
have to return the count of packets actually handled.  My hope is to get
the KPI change committed in time for 8.0, even if we don't switch the
algorithm yet.  Attilio (on CC:) and I will make the patch set for the
KPI change available shortly for comment.

-Ed