6.2 SHOWSTOPPER - em completely unusable on 6.2

Scott Long scottl at samsco.org
Wed Sep 27 09:33:18 PDT 2006


Oliver Brandmueller wrote:
> Hi,
> 
> On Wed, Sep 27, 2006 at 08:55:53AM -0700, Jeremy Chadwick wrote:
> 
>>>The SMBus Interface is not used at all (it's not even really usable). 
>>>Anyway, as soon as I unload the ichsmb module I cannot triger the 
>>>problem anymore. If I load it again, the problem cann again be triggered 
>>>by a buildworld. Statistical relevance: I did 4 buildworlds, alternating 
>>>the load/unload of ichsmb - both times with ichsmb loaded I saw 3 
>>>watchdog timeouts during the buildworld was running, while ichsmb was 
>>>not loaded I did not see a single watchdog timeout. The use of the 
>>>interface was around the same during all the time (constant NFS traffic 
>>>of around 1-2 MBit/s).
>>
>>Interesting find.  For what it's worth -- I too load the appropriate
>>smbus drivers on the system with the "em0 problem" (loading smbus and
>>ichsmb).  That system is a single processor / single core system, with
>>HT disabled in the BIOS (which doesn't matter since FreeBSD disables
>>it anyways).  Kernel is non-SMP.  Only reason I mention this is:
>>
>>
>>>The UP/SMP idea seems to be only of interest, because on an UP machine 
>>>it's more likely to share interrupts than on SMP machines, it has 
>>>nothing to do with the fact of UP or SMP itself.
> 
> 
> I don't think it has to especially with ichsmb here, but only with the 
> fact, that ichsmb is for me exactly the thing that shares the interrupt 
> with the em interface that shows the problems.
> 
> - Oliver
> 

My theory here is that something in the kernel, likely VM/VFS, is
holding the Giant lock for an inordinate amount of time.  During this
time, an interrupt fires on the shared em/ichsmb interrupt.  The em
interrupt handler runs and schedules a task to handle the event.  Then
the system blocks the interrupt at the PIC and schedules the ichsmb
ithread.  However, as soon as this ithread tries to run, it gets blocked
on the Giant lock that is held elsewhere.  While it is blocked, the
interrupt stays masked at the PIC, blocking out both ichsmb and em
device interrupts.  Normally the PIC would get unmasked after the
ithread has run, but until the ithread unblocks, this cannot happen.
This goes on long enough that pending transactions on the em interface
trigger a timeout.

Assuming the this analysis is correct, there are a couple of questions.
First would be, why is the ithread being blocked for so long?  Is the
Giant lock actually being held continuously for that long, or is being
dropped and relocked often but the scheduler isn't giving the ithread a
chance to grab it and run?  Second is, why is this only being noticed
now?  Whether the em driver uses an INTR_FAST handler, like it does now,
or an ithread handler, like it used to in 6.1, doesn't affect the ichsmb
driver and its interaction with the Giant lock.  Maybe there isn't a
direct correlation here, and it's just a coincidence that something else
in the system changed at the same time as the driver changing.

I have a few ideas on tracking down the root cause, but they are pretty
pretty painful and slow.  The root cause does need to be found and
fixed, as it's either a very bad scheduler bug, or a very badly
misbehaving subsystem.  Both have implications for other possible
problems in FreeBSD.  Also, the usb driver has the same potential for
blocking as the ichsmb driver, as do other drivers.  But in the mean
time, something needs to be done for 6.2.  The options are:

1. Revert the em driver to its 6.1 form, ask people to test if the
problem persists.  If it doesn't, leave it at that for now.

2. Add INTR_FAST shims to the usb and ichsmb drivers so that neither
uses an ithread.  Without an ithread, no PIC masking will happen, and
these drivers can block all they want without interfering with the
em driver.  This is a bit of risky work, though, and may not be possible
if the devices don't support certain functionality.  Also, it doesn't
address the root problem.  But, getting more interrupt handlers away
from needing Giant is a good thing, even if this only a band-aid.

3. Spend the time tracking down and fixing the root problem for 6.2.
This is ideal, but it is also an unbounded problem.  Thus, it is
absolutely not conducive for having a timely and successful 6.2 release.

4. Do nothing for now and tell people to disable usb, ichsmb, etc, as
needed.  This, of course, is not a good option.

Option 1 is the quickest and likely most risk-free fix for the 6.2
release.  If someone could test doing a revert and report back, I would
appreciate it.  Any volunteers?

Scott



More information about the freebsd-stable mailing list