6.2 SHOWSTOPPER - em completely unusable on 6.2

Wed Sep 27 06:17:45 PDT 2006

Scott Long wrote:
> Oliver Brandmueller wrote:
> 
>> Hi,
>>
>> On Wed, Sep 27, 2006 at 08:00:21AM +0200, Martin Nilsson wrote:
>>
>>> I get tons of these:
>>> em0: watchdog timeout -- resetting
>>> em0: link state changed to DOWN
>>> em0: link state changed to UP
>>>
>>> mailbox# pciconf -lv
>>> em0 at pci13:0:0:  class=0x020000 card=0x108c15d9 chip=0x108c8086 
>>> rev=0x03 hdr=0x00
>>>    vendor   = 'Intel Corporation'
>>>    device   = 'PRO/1000 PM'
>>>    class    = network
>>>    subclass = ethernet
>>> em1 at pci14:0:0:  class=0x020000 card=0x109a15d9 chip=0x109a8086 
>>> rev=0x00 hdr=0x00
>>>    vendor   = 'Intel Corporation'
>>>    class    = network
>>>    subclass = ethernet
>>>
>>
>> [...]
>>
>>> I have only seen them on em0. Yesterday I tried sysutils/cpuburn on 
>>> similar boxes that are netbooted with NFS mounted drives and 
>>> everytime I loaded the two CPU cores the network went down.
>>
>>
>>
>> I see the same.
>>
>> Very much on this one, where I workaround the problem by using polling,
>> it's a UP machine.
>>
>> FreeBSD nessie 6.2-PRERELEASE FreeBSD 6.2-PRERELEASE #3: Fri Sep 15 
>> 09:48:36 CEST 2006     root at nessie:/usr/obj/usr/src/sys/NESSIE  i386
>>
>> em0 at pci2:1:0:   class=0x020000 card=0x10198086 chip=0x10198086 
>> rev=0x00 hdr=0x00
>>     vendor   = 'Intel Corporation'
>>     device   = '82547EI Gigabit Ethernet Controller (LOM)'
>>     class    = network
>>     subclass = ethernet
>>
>> irq18: em0 uhci2                    3319          0
>>
>>
>> Another machine, also UP, but with two interfaces. The problem is not 
>> as apparent as on the first machine, but it's there. This machine is 
>> not as loaded usually (CPU wise) as the first machine. The problem is 
>> ONLY on em1:
>>
>> FreeBSD hudson 6.2-PRERELEASE FreeBSD 6.2-PRERELEASE #48: Thu Sep 14 
>> 10:19:46 CEST 2006     root at hudson:/usr/obj/usr/src/sys/NFS-32-FBSD6  
>> i386
>>
>> em0 at pci1:1:0:   class=0x020000 card=0x10758086 chip=0x10758086 
>> rev=0x00 hdr=0x00
>>     vendor   = 'Intel Corporation'
>>     device   = '82547EI Gigabit Ethernet Controller'
>>     class    = network
>>     subclass = ethernet
>>
>> em1 at pci3:2:0:   class=0x020000 card=0x10768086 chip=0x10768086 
>> rev=0x00 hdr=0x00
>>     vendor   = 'Intel Corporation'
>>     device   = '82547EI Gigabit Ethernet Controller'
>>     class    = network
>>     subclass = ethernet
>>
>> irq17: em1 ichsmb0             950121879        855
>> irq18: em0                      71437344         64
>>
>>
>> The problem appeared after the em updates during the last weeks in the
>> kernel and has not been observed before this. em is always loaded as a 
>> module in my kernels. The problem seems to occur more often if the 
>> machine's CPU is busy.
>>
>>
>> I have several SMP machines with the following em interfaces, which 
>> DON'T show the problem, but they also have different chipset on the em 
>> interface. Most of the kernels were built between Sep 7 and Sep 19.
>>
>> 3 times this:
>> em0 at pci4:5:0:   class=0x020000 card=0x34248086 chip=0x10108086 
>> rev=0x01 hdr=0x00
>> em1 at pci4:5:1:   class=0x020000 card=0x34248086 chip=0x10108086 
>> rev=0x01 hdr=0x00
>> irq23: em0                     970303432        750
>>
>>
>>
>> 3 times this:
>> em0 at pci4:5:0:   class=0x020000 card=0x34258086 chip=0x100e8086 
>> rev=0x02 hdr=0x00
>> irq23: em0                     292477376        435
>>
>>
>> So I can observe at least 3 interesting differences:
>>
>> - the interface showing the problems shares the interrupt
>> - for me it happens on UP machines only
>> - the chips are different
>>
>> What I can't do: moving the interfaces between machines, these are 
>>                  onboard interfaces.
>>
>> What I could do: I could try to unload the USB driver or the ichsmb 
>> driver on the machiens, where the interrupts are shared. Anyway, the 
>> USB is not used currently (I have it enabled to be prepared to hook up 
>> a USB Mass Storage device, which never happend since the problem 
>> occured). The ichsmb also is usually not queried.
>>
>> Any suggestions on how I could help?
>>
>> - Olli
>>
>>
> 
> Well, the best I can say at the moment is, "Wow."  =-(  I guess the 
> thing to do here is to figure out if the problem lies with the em 
> interrupt handler not getting run, or the taskqueue not getting run.
> Since you've stated that it seems to be related to shared interrupts,
> the first possibility is more likely.  However, I'm not sure why the
> symptom would only be showing up now.  The Intel docs say that the
> 82547EI are a bit interesting, and I wonder if assumptions that we
> make about PCI ordering aren't true (or if there are bugs that make
> our assumptions invalid).
> 
> Does this happen after there has been a lot of disk activity, like a
> large tar extraction?  Are you using the SMBus interface at all, or is
> it sitting completely idle?

I have experienced this problem also.  It happens when the system is 
definitely not idle.  So I am simulataneously dung large internet 
transfers (via em), using the graphics card with OpenGL, and building 
the kde port.  I have actually had this problem for a month or so, so if 
it is a software fault it was introduced into the OS quite recently.  (I 
tend to rebuild RELENG_6 about twice a month.)

Stephen