cvs commit: src/sys/dev/bge if_bge.c

Sat Dec 23 15:19:30 PST 2006

On Sun, 24 Dec 2006, Oleg Bulyzhin wrote:

>>> We currently make this a lot worse than it needs to be by handing off the 
>>> received packets one at a time, unlocking and relocking for every packet. 
>>> It would be better if the driver's receive interrupt handler would harvest 
>>> all of the incoming packets and queue them locally. Then, at the end, hand 
>>> off the linked list of packets to the network stack wholesale, unlocking 
>>> and relocking only once.  (Actually, the list could probably be handed off 
>>> at the very end of the interrupt service routine, after the driver has 
>>> already dropped its lock.)  We wouldn't even need a new primitive, if 
>>> ether_input() and the other if_input() functions were enhanced to deal 
>>> with a possible list of packets instead of just a single one.
>>
>> I try this experiement every few years, and generally don't measure much 
>> improvement.  I'll try it again with 10gbps early next year once back in 
>> the office again.  The more interesting transition is between the link 
>> layer and the network layer, which is high on my list of topics to look 
>> into in the next few weeks.  In particular, reworking the ifqueue handoff. 
>> The tricky bit is balancing latency, overhead, and concurrency...
>>
>> FYI, there are several sets of patches floating around to modify if_em to 
>> hand off queues of packets to the link layer, etc.  They probably need 
>> updating, of course, since if_em has changed quite a bit in the last year. 
>> In my implementaiton, I add a new input routine that accepts mbuf packet 
>> queues.
>
> I'm just curious, do you remember average length of mbuf queue in your 
> tests? While experimenting with bge(4) driver (taskqueue, interrupt 
> moderation, converted bge_rxeof() to above scheme), i've found it's quite 
> easy to exhaust available mbuf clusters under load (trying to queue 
> hundreids of received packets). So i had to limit rx queue to rather low 
> length.

Off-hand, I don't remember.  I do remember it being very important to maintain 
bounds on the size of in-flight packet sets at all levels in the stack -- for 
the same reason the netisr dispatch queue is bounded.  Otherwise if the device 
is able to keep the device driver entirely busy, you'll effectively live-lock 
since you never dispatch to the next layer, exhaust available memory, etc, 
etc.  One of the ideas I've been futzing with is "back-pressure" across the 
netisr and a "checkout" model in which the total length of the queue spanning 
device driver and dispatch through to the protocol has a total bound with 
reservations taken by components as they process sets of packets.  In this 
way, the ithread would know the netisr was already in execution and not 
perform a wakeup (and getting involved in the scheduler), avoid excessive 
memory consumption, etc.  Ed Maste has also suggested changing our notion of 
mbuf packet queues, as our current queue model requires following linked 
lists, which make inefficient use of of CPU caches, and instead using arrays 
of mbuf pointers.  I've done a bit of experimentation along these lines, but 
not enough to investigate the properties well.

Robert N M Watson
Computer Laboratory
University of Cambridge