em(4) receive part wedging randomly at moderate load

Mon Sep 26 07:29:12 PDT 2005

  Colleagues,

  during last month we are experiencing a nasty problem with em(4)
driver. Several times a day the receive path of the driver wedges
for a minute or two. During wedge the transmit part works with
no problems. The latter fact makes this problem very nasty, because
the problematic router can't be backed up with help of CARP.

Some details: during the wedge all incoming packets are lost and
counted as "Missed packets". I've checked this using
`sysctl dev.em.0.stats=1`. The `dmesg` output is the following:

em0: Excessive collisions = 0
em0: Symbol errors = 0
em0: Sequence errors = 0
em0: Defer count = 0
em0: Missed Packets = 1266
em0: Receive No Buffers = 220
em0: Receive length errors = 0
em0: Receive errors = 0
em0: Crc errors = 0
em0: Alignment errors = 0
em0: Carrier extension errors = 0
em0: XON Rcvd = 0
em0: XON Xmtd = 0
em0: XOFF Rcvd = 0
em0: XOFF Xmtd = 0
em0: Good Packets Rcvd = 28347789
em0: Good Packets Xmtd = 30911959

There is a clear evidence that command `sysctl dev.em.0.stats=1` itself
can trigger the wedge. It is important, that the stats are printed
to a 9600 baud serial console, and this takes about a second. I have
suspicion, that the wedge happens when kernel doesn't service NIC
interrupts for some period of time. Yes, some packets should be lost in
this case, but the wedge must not continue for minutes!

The box is serving 8 - 15 kpps, 70 - 100 MBps. It runs stateful pf(4)
firewall, with 50k - 80k states. The IP fastforwarding is enabled. The
average state insert/removal ratio is 300 states per second, however
sometimes several thousands of states can be removed in one pass. The
state removal locks the network code for quite a long time, so I guess
that wedge happens exactly when a lot of states are removed. The NIC
interrupts aren't serviced for some time and it wedges.

The hardware is Supermicro server, with two onboard NICs: 

dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000
dev.em.1.%pnpinfo: vendor=0x8086 device=0x1076 subvendor=0x8086 subdevice=0x1076 class=0x020000

The NIC is plugged in Cisco Catalyst 6509 gigabit ethernet port. No
errors are counted on switch port.

To workaround the problem, I have made the following patch:

@@ -1650,12 +1651,18 @@
        struct ifnet   *ifp;
        struct adapter * adapter = arg;
        ifp = adapter->ifp;
+       uint64_t        ompc;
 
        EM_LOCK(adapter);
 
        em_check_for_link(&adapter->hw);
        em_print_link_status(adapter);
-       em_update_stats_counters(adapter);   
+       ompc = adapter->stats.mpc;
+       em_update_stats_counters(adapter);
+       if (adapter->stats.mpc > ompc) {
+               printf("em watchdog: mpc %lld->%lld\n", ompc, adapter->stats.mpc);
+               em_init_locked(adapter);
+       }
        if (em_display_debug_stats && ifp->if_drv_flags & IFF_DRV_RUNNING) {
                em_print_hw_stats(adapter);
        }

It helps to reduce downtime from few minutes to 2 seconds, but this
is very dirty approach to the problem. Sample prints during runtime
with patch:

em watchdog: mpc 1767->2739
em watchdog: mpc 2739->4724
em watchdog: mpc 4724->7794
em watchdog: mpc 7794->10729

Every time this is printed, the network wedges for 2 seconds and then
it revives.

I am asking developers, who work in Intel, to pay attention to this problem.
>From my side I can offer any help in testing and debugging.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE