6.2 SHOWSTOPPER - em completely unusable on 6.2
Stephen Montgomery-Smith
stephen at math.missouri.edu
Wed Sep 27 06:17:45 PDT 2006
Scott Long wrote:
> Oliver Brandmueller wrote:
>
>> Hi,
>>
>> On Wed, Sep 27, 2006 at 08:00:21AM +0200, Martin Nilsson wrote:
>>
>>> I get tons of these:
>>> em0: watchdog timeout -- resetting
>>> em0: link state changed to DOWN
>>> em0: link state changed to UP
>>>
>>> mailbox# pciconf -lv
>>> em0 at pci13:0:0: class=0x020000 card=0x108c15d9 chip=0x108c8086
>>> rev=0x03 hdr=0x00
>>> vendor = 'Intel Corporation'
>>> device = 'PRO/1000 PM'
>>> class = network
>>> subclass = ethernet
>>> em1 at pci14:0:0: class=0x020000 card=0x109a15d9 chip=0x109a8086
>>> rev=0x00 hdr=0x00
>>> vendor = 'Intel Corporation'
>>> class = network
>>> subclass = ethernet
>>>
>>
>> [...]
>>
>>> I have only seen them on em0. Yesterday I tried sysutils/cpuburn on
>>> similar boxes that are netbooted with NFS mounted drives and
>>> everytime I loaded the two CPU cores the network went down.
>>
>>
>>
>> I see the same.
>>
>> Very much on this one, where I workaround the problem by using polling,
>> it's a UP machine.
>>
>> FreeBSD nessie 6.2-PRERELEASE FreeBSD 6.2-PRERELEASE #3: Fri Sep 15
>> 09:48:36 CEST 2006 root at nessie:/usr/obj/usr/src/sys/NESSIE i386
>>
>> em0 at pci2:1:0: class=0x020000 card=0x10198086 chip=0x10198086
>> rev=0x00 hdr=0x00
>> vendor = 'Intel Corporation'
>> device = '82547EI Gigabit Ethernet Controller (LOM)'
>> class = network
>> subclass = ethernet
>>
>> irq18: em0 uhci2 3319 0
>>
>>
>> Another machine, also UP, but with two interfaces. The problem is not
>> as apparent as on the first machine, but it's there. This machine is
>> not as loaded usually (CPU wise) as the first machine. The problem is
>> ONLY on em1:
>>
>> FreeBSD hudson 6.2-PRERELEASE FreeBSD 6.2-PRERELEASE #48: Thu Sep 14
>> 10:19:46 CEST 2006 root at hudson:/usr/obj/usr/src/sys/NFS-32-FBSD6
>> i386
>>
>> em0 at pci1:1:0: class=0x020000 card=0x10758086 chip=0x10758086
>> rev=0x00 hdr=0x00
>> vendor = 'Intel Corporation'
>> device = '82547EI Gigabit Ethernet Controller'
>> class = network
>> subclass = ethernet
>>
>> em1 at pci3:2:0: class=0x020000 card=0x10768086 chip=0x10768086
>> rev=0x00 hdr=0x00
>> vendor = 'Intel Corporation'
>> device = '82547EI Gigabit Ethernet Controller'
>> class = network
>> subclass = ethernet
>>
>> irq17: em1 ichsmb0 950121879 855
>> irq18: em0 71437344 64
>>
>>
>> The problem appeared after the em updates during the last weeks in the
>> kernel and has not been observed before this. em is always loaded as a
>> module in my kernels. The problem seems to occur more often if the
>> machine's CPU is busy.
>>
>>
>> I have several SMP machines with the following em interfaces, which
>> DON'T show the problem, but they also have different chipset on the em
>> interface. Most of the kernels were built between Sep 7 and Sep 19.
>>
>> 3 times this:
>> em0 at pci4:5:0: class=0x020000 card=0x34248086 chip=0x10108086
>> rev=0x01 hdr=0x00
>> em1 at pci4:5:1: class=0x020000 card=0x34248086 chip=0x10108086
>> rev=0x01 hdr=0x00
>> irq23: em0 970303432 750
>>
>>
>>
>> 3 times this:
>> em0 at pci4:5:0: class=0x020000 card=0x34258086 chip=0x100e8086
>> rev=0x02 hdr=0x00
>> irq23: em0 292477376 435
>>
>>
>> So I can observe at least 3 interesting differences:
>>
>> - the interface showing the problems shares the interrupt
>> - for me it happens on UP machines only
>> - the chips are different
>>
>> What I can't do: moving the interfaces between machines, these are
>> onboard interfaces.
>>
>> What I could do: I could try to unload the USB driver or the ichsmb
>> driver on the machiens, where the interrupts are shared. Anyway, the
>> USB is not used currently (I have it enabled to be prepared to hook up
>> a USB Mass Storage device, which never happend since the problem
>> occured). The ichsmb also is usually not queried.
>>
>> Any suggestions on how I could help?
>>
>> - Olli
>>
>>
>
> Well, the best I can say at the moment is, "Wow." =-( I guess the
> thing to do here is to figure out if the problem lies with the em
> interrupt handler not getting run, or the taskqueue not getting run.
> Since you've stated that it seems to be related to shared interrupts,
> the first possibility is more likely. However, I'm not sure why the
> symptom would only be showing up now. The Intel docs say that the
> 82547EI are a bit interesting, and I wonder if assumptions that we
> make about PCI ordering aren't true (or if there are bugs that make
> our assumptions invalid).
>
> Does this happen after there has been a lot of disk activity, like a
> large tar extraction? Are you using the SMBus interface at all, or is
> it sitting completely idle?
I have experienced this problem also. It happens when the system is
definitely not idle. So I am simulataneously dung large internet
transfers (via em), using the graphics card with OpenGL, and building
the kde port. I have actually had this problem for a month or so, so if
it is a software fault it was introduced into the OS quite recently. (I
tend to rebuild RELENG_6 about twice a month.)
Stephen
More information about the freebsd-stable
mailing list