cvs commit: src/sys/dev/em if_em.c if_em.h

Thu Jan 12 13:07:01 PST 2006

Andrew Gallatin wrote:
> Scott Long [scottl at FreeBSD.org] wrote:
> 
>>scottl      2006-01-11 00:30:25 UTC
>>
>>  FreeBSD src repository
>>
>>  Modified files:
>>    sys/dev/em           if_em.c if_em.h 
>>  Log:
>>  Significant performance improvements for the if_em driver:
> 
> 
> Very cool.
> 
> 
>>  - If possible, use a fast interupt handler instead of an ithread handler.  Use
>>    the interrupt handler to check and squelch the interrupt, then schedule a
>>    taskqueue to do the actual work.  This has three benefits:
>>    - Eliminates the 'interrupt aliasing' problem found in many chipsets by
>>      allowing the driver to mask the interrupt in the NIC instead of the
>>      OS masking the interrupt in the APIC.
> 
> 
> Neat.  Just like Windows..
> 
> <....>
> 
> 
>>    - Don't hold the driver lock in the RX handler.  The handler and all data
>>      associated is effectively serialized already.  This eliminates the cost of
>>      dropping and reaquiring the lock for every receieved packet.  The result
>>      is much lower contention for the driver lock, resulting in lower CPU usage
>>      and lower latency for interactive workloads.
> 
> 
> This seems orthogonal to using a fastintr/taskqueue, or am I missing 
> something?
> 
> Assuming a system where interrupt aliasing is not a problem, how much
> does using a fastintr/taskqueue change interrupt latency as compared
> to using an ithread?  I would (naively) assume that using an ithread
> would be faster & cheaper.  Or is disabling/enabling interrupts in the
> apic really expensive?
> 

Touching the APIC is tricky.  First, you have to pay the cost of a 
spinlock.  Then you have to may the cost of at least one read and write 
across the FSB.  Even though the APIC registers are memory mapped, they 
are still uncached.  It's not terribly expensive, but it does add up.
Bypassing this and using a fast interrupt means that you pay the cost of
1 PCI read, which you would have to do anyways with either method, and 1 
PCI write, which will be posted at the host-pci bridge and thus only as 
expensive as an FSB write.  Overall, I don't think that the cost 
difference is a whole lot, but when you are talking about thousands of
interrupts per second, especially if multiple interfaces are running 
under load, it might be important.  And the 750x and 752x chipsets are
so common that it is worthwhile to deal with them (and there are reports
that the aliasing problem is happening on more chipsets than just these 
now).

As for latency, the taskqueue runs at the same PI_NET priority as an the
ithread would.  I thought that there was an optimization on some 
platforms to encourage quick preemption for ithreads when they are 
scheduled, but I can't find it now.  So, the taskqueue shouldn't be all
that different from an ithread, and it even means that there won't be
any sharing between instances even if the interrupt vector is shared.

Another advantage is that you get adaptive polling for free.  Interface
polling only works well when you have a consistently high workload.  For
spikey workloads, you do get higher latency at the leading edge of the
spike since the polling thread is asleep while waiting for the next
tick.  Trying to estimate workload and latency in the polling loop is a
pain, while letting the hardware trigger you directly is a whole lot
easier.

However, taskqueues are really just a proof of concept for what I really
want, which is to allow drivers to register both a fast handler and an
ithread handler.  For drivers doing this, the ithread would be private
to the driver and would only be activated if the fast handler signals
it.  Drivers without fast handlers would still get ithreads that would
still act the way they do now.  If an interrupt vector is shared with
multiple handlers, the fast handlers would all get run, but the only
ithreads that would run would be for drivers without a fast handler and
for drivers that signaled for it to run from the fast handler.  Anyways,
John and I have discussed this quite a bit over the last year, we just
need time to implement it.

> Do you have a feel for how much of the increase was do to the other
> changes (rx lock, avoiding register reads)?

Both of those do make a difference, but I didn't introduce them into
testing until Andre had already done some tests that showed that the
taskqueue helped.  I don't recall what the difference was, but I think
it was in low 10% range.  Another thing that I want to do is to get the
tx-complete path to run without a lock.  For if_em, this means killing
the shortcut in en_encap of calling into it to clean up the tx ring.
it also means being careful with updating and checking the tx ring
counters between the two sides of the driver.  But if it can be made to
work then almost all top/bottom contention in the driver can be
eliminated.

Scott