em network issues

Thu Oct 19 06:20:58 UTC 2006

On Wed, 18 Oct 2006, Scott Long wrote:

[too much quoted; much deleted]

> Bruce Evans wrote:
>> On Wed, 18 Oct 2006, Kris Kennaway wrote:
>> 
>>> I have been working with someone's system that has em shared with fxp,
>>> and a simple fetch over the em (e.g. of a 10 GB file of zeroes) is
>>> enough to produce watchdog timeouts after a few seconds.
>>> 
>>> As previously mentioned, changing the INTR_FAST to INTR_MPSAFE in the
>>> driver avoids this problem.  However, others are seeing sporadic
>>> watchdog timeouts at higher system load on non-shared em systems too.
>> 
>> em_intr_fast() has no locking whatsoever.  I would be very surprised
>> if it even seemed to work for SMP.  For UP, masking of CPU interrupts
>> (as is automatic in fast interrupt handlers) might provide sufficient
>> locking, ...

I barely noticed the point about it being shared.  With sharing, and
probably especially with fast and normal interrupt handlers sharing an
IRQ, locking is more needed.  There are many possibilities for races.
One likely one is:
- em interrupt task running.  Device interrupts are disabled, so the
   task thinks it won't be interfered with by the em interrupt handler.
- shared fxp interrupt.  The em interrupt handler is called.  Without
   any explicit synchonization, bad things may happen and apparently do.
   In the UP case, there is some implicit synchronization which may help
   but is hard to understand.

>> This is safe, as above.
>> 
>> %     /*
>> %      * Mask interrupts until the taskqueue is finished running.  This is
>> %      * cheap, just assume that it is needed.  This also works around the
>> %      * MSI message reordering errata on certain systems.
>> %      */
>> %     em_disable_intr(adapter);
>> 
>> Now that we start doing things, we have various races.
>> 
>> The above races to disable interrupts with other entries to this interrupt
>> handler, and may race with other parts of the driver.
>> 
>> After we disable driver interrupts.  There should be no more races with
>> other entries to this handler.  However, reg_icr may be stale at this
>> point even if we handled the race correctly.  The other entries may have
>> partly or completely handled the interrupt when we get back here (we
>> should have locked just before here, and then if the lock blocked waiting
>> for the other entries (which can only happen in the SMP case), we should
>> reread the status register to see if we still have anything to do, or
>> more importantly to see what we have to do now (extrascheduling of the
>> SWI handler would just wake time, but missing scheduling would break
>> things).
>> 
>> %     taskqueue_enqueue(adapter->tq, &adapter->rxtx_task);
>> %
>> 
>> Safe provided the API is correctly implemented.  (AFAIK, the API only
>> has huge design errors.)
>> 
>> %     /* Link status change */
>> %     if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))
>> %         taskqueue_enqueue(taskqueue_fast, &adapter->link_task);
>> %
>> 
>> As above, plus might miss this call if the status changed underneath us.
>> 
>> %     if (reg_icr & E1000_ICR_RXO)
>> %         adapter->rx_overruns++;
>> 
>> Race updating the counter.
>> 
>> Generally, fast interrupt handlers should avoid book-keeping like this,
>> since correct locking for it would poison large parts of the driver
>> with the locking required for the fast interrupt handler.
>
> This should be an atomic add.  It doesn't require any extra
> synchronization beyond that.  It is, however, only an informational
> counter, and one that I forgot to fix.

It should be under the same locking that you've already paid for.
Mutex spinlocking is (relatively) cheap.  Not cheap enough for me
though -- part of my gripe about the API above is that its
implementation requires lots of locking, so it cannot be implemented
very efficiently.

>> Perhaps
>> similarly for important things.  It's safe to read the status register
>> provided reading it doesn't change it.  Then it is safe to schedule
>> tasks based on the contents of the register provided we don't do
>> anything else and schedule enough tasks.  But don't disable interrupts
>> -- leave that to the task and make the task do nothing if it handled
>> everything for a previous scheduling. 
>
> There is no other way to quiet the interrupt line from the device.  Some

Oops, the interrupt must be quieted somwhere before the task is reached.
One other way is to not use tasks but turn the interrupt handler into
a non-fast one if it decides that it has more to do that the fast part
of it can do.  This can be implemented more efficiently than the above,
but has the disadvantage that shared IRQs remain disabled for all other
devices.

> devices will quiet their interrupt if you read the ISR or write back to
> the ISR.  The e1000 chip doesn't seem to be like this, and instead
> relies on waiting for you to clear the rx and tx rings.  i.e. it won't
> clear until you do real work, and you can't do real work in the
> INTR_FAST handler.  So, if you remove the em_disable_intr, you'll get an
> interrupt storm in between this handler running and the taskqueue
> completing.  That obviously won't work.
>
> If there is a better way to quiet the interrupt here, please let me
> know.  My conversation with Intel about this 9 months ago lead me to
> believe that my method here is correct (modulo what I'll say below).
> Note that the other INTR_FAST drivers I've done, namely aac and uhci,
> have hardware that does allow you to silence the interrupt from the
> handler without having to do expensive work.

I think there would be little difference if the hardware disabled
interrupts automatically on read of the status.  The read+side effect
should be atomic, but the status may be changed underneath you
immediately by the em task running concurrently, and later the
problem of reenabling the interrupt is exactly the same.

>> This would result in the task
>> usually being scheduled when the interrupt is for us but not if it is
>> for another device.  The above doesn't try to do much more than this.
>> However, a fast interrupt handler needs to handle the usual case to
>> be worth having except on systems where there are lots of shared
>> interrupts.
>
> The performance measurements that Andre and I did early this year showed
> that the INTR_FAST handler provided a very large benefit.  However,
> there are 2 things that I need to fix here:

I see how the taskqueue might win by often handling multiple interrupts
for a single invocation of the taskqueue function.  I'm surprised that
this happens for em.  If this is what is happening, then using a normal
interrupt handler kicking a task queue should give almost the same
benefits -- a normal interrupt handler takes a few uS longer but with
enough coalescing to give a large benefits there won't be many interrupts.

> ...
> 2) Protect top-half threads that want to call em_intr_disable/enable from the 
> interrupt handler and taskqueue.  IIRC, linux does this with
> a simple semaphore-like count.  That means extra atomic ops and probably
> spinlocks, neither of which would be very good for this driver in FreeBSD. 
> My plan here instead is to have the top-half callers set a 'paused' flag and 
> drain the taskqueue before calling em_disable_intr.
> That way they know that there are no tasks scheduled or waiting on a lock 
> release or in progress, and thus no chance that em_enable_intr will
> be called behind their back.  I know that there is still a tricky race
> here, so I haven't implemented it yet.  In any case, quiescing the
> interrupt and taskqueue is expensive, but it only needs to be done for
> fairly expensive and infrequent conditions (ioctls, unload/detach).

You put the task scheduling after the interrupt disabling.  That should be
enough to ensure that the task runs later and reenables the interrupt.
Apparently, it isn't.  Now I can't see any problems except this not
working.

Bruce