cvs commit: src/sys/dev/bge if_bge.c

Sat Dec 23 17:39:43 PST 2006

On Sat, 23 Dec 2006, Robert Watson wrote:

> On Sat, 23 Dec 2006, John Polstra wrote:
>
>>> That said, dropping and regrabbing the driver lock in the rxeof routine of 
>>> any driver is bad.  It may be safe to do, but it incurs horrible 
>>> performance penalties.  It essentially allows the time-critical, high 
>>> priority RX path to be constantly preempted by the lower priority if_start 
>>> or if_ioctl paths.  Even without this preemption and priority inversion, 
>>> you're doing an excessive number of expensive lock ops in the fast path.

It's not very time-critical or high priority for bge or any other device
that has a reasonably large rx ring.  With a ring size of 512 and an rx
interrupt occuring not too near the end (say at half way), you have 256
packet times to finish processing the interrupt.  For normal 1518 byte
packets at 1Gbps, 256 packet times is about 3 mS.  bge's rx ring size
is actually larger than 512 for most hardware.

>> We currently make this a lot worse than it needs to be by handing off the 
>> received packets one at a time, unlocking and relocking for every packet. 
>> It would be better if the driver's receive interrupt handler would harvest 
>> all of the incoming packets and queue them locally. Then, at the end, hand 
>> off the linked list of packets to the network stack wholesale, unlocking 
>> and relocking only once.  (Actually, the list could probably be handed off 
>> at the very end of the interrupt service routine, after the driver has 
>> already dropped its lock.)  We wouldn't even need a new primitive, if 
>> ether_input() and the other if_input() functions were enhanced to deal with 
>> a possible list of packets instead of just a single one.

Do a bit more than that and you have reinvented fast interrupt handling
:-).  However, with large buffers the complications for fast interrupt
handling are not very needed.  A fast interrupt handler would queue
all the packets (taking care not to be blocked by normal spinlocks etc.,
unlike the "fast" interrupt handlers in -current) and then schedule a
low[er] priority thread to finish the handling.  With large buffers, the
lower priority thread can just be scheduled immediately.

> I try this experiement every few years, and generally don't measure much 
> improvement.  I'll try it again with 10gbps early next year once back in the 
> office again.  The more interesting transition is between the link layer and 
> the network layer, which is high on my list of topics to look into in the 
> next few weeks.  In particular, reworking the ifqueue handoff.  The tricky 
> bit is balancing latency, overhead, and concurrency...

These are very unbalanced now, so you don't have to worry about breaking
the balance :-).  I normal unbalance to optimize latency (45-60 uS ping
latency).

Bruce