[Bug 199174] em tx and rx hang

Sun Apr 5 13:28:24 UTC 2015

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199174

            Bug ID: 199174
           Summary: em tx and rx hang
           Product: Base System
           Version: 10.1-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: david.keller at litchis.fr

Hi,

While sending moderated nfs traffic < 2Mo/s, the interface suddenly stopped
transmitting/receiving.

However the interface seemed fine:
$ ifconfig
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000

options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
    ether 00:25:90:34:5d:44
    inet YYYY netmask 0xffffff00 broadcast YYY.255 
    inet6 fe80::225:90ff:fe34:5d44%em0 prefixlen 64 scopeid 0x1 
    inet6 XXXX prefixlen 64 autoconf 
    nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active

Pinging gateway didn't work:
$ ping ZZZZ
PING ZZZZ (ZZZZ): 56 data bytes
ping: sendto: Host is down
ping: sendto: Host is down

But driver seemed happy with the card as no particular message was printed.

# tcpdump -ni em0
-> No rx traffic, only tx.

Printing em driver internal variables was more interesting:
$ sysctl dev.em.0.debug=1
Interface is RUNNING and ACTIVE
em0: hw tdh = 325, hw tdt = 166
em0: hw rdh = 688, hw rdt = 687
em0: Tx Queue Status = 1
em0: TX descriptors avail = 150
em0: Tx Descriptors avail failure = 0
em0: RX discarded packets = 0
em0: RX Next to Check = 688
em0: RX Next to Refresh = 687

Sending PING request incremented hw tdt as expected. Wondering what would
happen when it would reach tdh value, I ping-flooded the gateway.

Driver figured out something was going bad and reset the card:
#ping -f ZZZZ
em0: Watchdog timeout -- resetting
em0: Queue(0) tdh = 325, hw tdt = 285
em0: TX(0) desc avail = 31,Next TX to Clean = 316
em0: link state changed to DOWN
em0: link state changed to UP
Interface is RUNNING and ACTIVE
em0: hw tdh = 113, hw tdt = 113
em0: hw rdh = 36, hw rdt = 35
em0: Tx Queue Status = 0
em0: TX descriptors avail = 1024
em0: Tx Descriptors avail failure = 0
em0: RX discarded packets = 0
em0: RX Next to Check = 36
em0: RX Next to Refresh = 35

>From here, the interface was working as usual.
$ ping ZZZZ
PING ZZZZ (ZZZZ): 56 data bytes
64 bytes from ZZZZ: icmp_seq=0 ttl=255 time=0.241 ms

$dmesg
FreeBSD 10.1-RELEASE-p6 #0: Tue Feb 24 19:00:21 UTC 2015
[...]
em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xdc00-0xdc1f mem
0xfe9e0000-0xfe9fffff,0xfe9dc000-0xfe9dffff irq 16 at device 0.0 on pci2
em0: Using MSIX interrupts with 3 vectors
em0: Ethernet address: 00:25:90:34:5d:44
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.5 on pci0
pci3: <ACPI PCI bus> on pcib3
em1: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xec00-0xec1f mem
0xfeae0000-0xfeafffff,0xfeadc000-0xfeadffff irq 17 at device 0.0 on pci3
em1: Using MSIX interrupts with 3 vectors
em1: Ethernet address: 00:25:90:34:5d:45

$pciconf -elv
[...]
em0 at pci0:2:0:0:    class=0x020000 card=0x060a15d9 chip=0x10d38086 rev=0x00
hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82574L Gigabit Network Connection'
    class      = network
    subclass   = ethernet
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Receiver Error
                 Bad TLP
                 Bad DLLP
                 REPLAY_NUM Rollover
                 Replay Timer Timeout
                 Advisory Non-Fatal Error
em1 at pci0:3:0:0:    class=0x020000 card=0x060a15d9 chip=0x10d38086 rev=0x00
hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82574L Gigabit Network Connection'
    class      = network
    subclass   = ethernet
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Receiver Error
                 Bad TLP
                 Bad DLLP
                 Replay Timer Timeout
                 Advisory Non-Fatal Error

The port is connected to a GS108 switch. Link was up the whole time and no
transmit error has been detected.

Motherboard is a Supermicro X7SPA-HF with latest bios.
On this board, there is a BMC sharing the em0 port. The BMC was not responding
either.

Hence my lucky guess would be that it may not be the driver fault as the BMC
has suffered too, but the card fault. 

This is also happening on an OpenBSD em0 with the same motherboard (but not
connected to the same switch).

-- 
You are receiving this mail because:
You are the assignee for the bug.