em network issues

Wed Oct 25 19:14:42 UTC 2006

Doug Ambrisko wrote:
> John Polstra writes:
> | On 19-Oct-2006 Scott Long wrote:
> | > The performance measurements that Andre and I did early this year showed
> | > that the INTR_FAST handler provided a very large benefit.
> | 
> | I'm trying to understand why that's the case.  Is it because an
> | INTR_FAST interrupt doesn't have to be masked and unmasked in the
> | APIC?  I can't see any other reason for much of a performance
> | difference in that driver.  With or without INTR_FAST, you've got
> | the bulk of the work being done in a background thread -- either the
> | ithread or the taskqueue thread.  It's not clear to me that it's any
> | cheaper to run a task than it is to run an ithread.
> | 
> | A difference might show up if you had two or more em devices sharing
> | the same IRQ.  Then they'd share one ithread, but would each get their
> | own taskqueue thread.  But sharing an IRQ among multiple gigabit NICs
> | would be avoided by anyone who cared about performance, so it's not a
> | very interesting case.  Besides, when you first committed this
> | stuff, INTR_FAST interrupts were not sharable.
> | 
> | Another change you made in the same commit (if_em.c revision 1.98)
> | greatly reduced the number of PCI writes made to the RX ring consumer
> | pointer register.  That would yield a significant performance
> | improvement.  Did you see gains from INTR_FAST even without this
> | independent change?
> 
> Something that we've fixed locally in atleast one version is:
>      1)	Limit the loop in em_intr to 3 iterations
>      2)	Pass a valid value to em_process_receive_interrupts/em_rxeof
> 	a good value like 100 instead of -1.  Since this is the count
> 	for how many time to iterate over the rx stuff.  Seems this
> 	got lost in the some change of APIs.
>      3)	In em_process_receive_interrupts/em_rxeof always decrement
> 	the count on every run through the loop.  If you notice
> 	count is an is an int that starts at the passed in value
> 	of -1.  It then count-- until count==0.  Doing -1, -2, -3
> 	takes awhile until the int rolls over to 0.   Passing 100
> 	limits it more :-)  So this can run 3 * 100 versuses
> 	infinite * int roll over assuming we don't skip a count--.
> Doing these changes made our multiple em based machines a lot happier
> when slammed with traffic without starving other things that shared
> interrupts like other em cards (especially in 4.X).  Interrupt handler 
> should have limits of how long they should be able to run then let 
> someone else go.  We use this in 6.X as well and haven't had any problems 
> with our config's that use this.  We haven't tested much without these
> since we need to fix other issues and this is now a non-issue for us.
> 
> I haven't pushed this more since I first found issue 1 and the concept
> was rejected ... my machine hung in the interrupt spin loop :-(
> 
> If someone wants to examine/play with it more then that's great.
> These issues (I think they are bugs) have been in there a while.
> 
> That's my 2 cents.
> 
> Doug A.

When I was first developing and testing the INTR_FAST patches, I did a 
similar thing with limiting the loop.  I can't recall why I dopped that 
(or if it was even me that dropped it).  I think it's a good idea to
generally have.  One concern that I've had with the whole 
INTR_FAST/taskqueue scheme is that having the rx loop be unbounded could
cause a livelock on UP.  In fact, I'm pretty sure that the performance
measurements done with the smartbits included having the loop be bounded.

Scott