kern/96391: Device timeouts on nve(4) [PATCH]

Nathan Whitehorn nathanw at uchicago.edu
Thu Apr 27 03:50:16 UTC 2006


>Number:         96391
>Category:       kern
>Synopsis:       Device timeouts on nve(4) [PATCH]
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Apr 27 03:50:14 GMT 2006
>Closed-Date:
>Last-Modified:
>Originator:     Nathan Whitehorn
>Release:        6.1-RC
>Organization:
University of Chicago
>Environment:
FreeBSD munuc.uchicago.edu 6.1-RC FreeBSD 6.1-RC #9: Wed Apr 26 22:02:06 CDT 2006     root at munuc.uchicago.edu:/usr/obj/usr/src/sys/MUNUC  amd64
>Description:
On some systems with nVidia NICs, especially nForce4, nve(4) reports frequent device timeouts (every 5-10 minutes) under low load. This seems to result, as per a note in the forcedeth source, from the nve MAC randomly failing to send tx acknowledgement interrupts. Under load, tx interrupts from other packets or rx interrupts will cause the interrupt routine to run and register the packet transmit notification. Under low load, the watchdog timer will expire before this happens, causing a device timeout and a MAC reset, which also briefly hangs the machine.
>How-To-Repeat:
Place an affected nve controller on a low-traffic network and watch the errors come rolling in.
>Fix:
We can fix the problem by calling the nVidia HAL's interrupt service routine from the nve_watchdog(), in effect causing an interrupt to occur if we're expecting one and it hasn't shown up yet. If the pending transmits counter is still non-zero, we conclude, as before, that the NIC has crashed and reset it, but we can just continue on our way if the problem is now resolved.

--- if_nve_original.c   Wed Apr 26 22:23:14 2006
+++ if_nve.c    Wed Apr 26 21:52:34 2006
@@ -1270,6 +1270,18 @@
 nve_watchdog(struct ifnet *ifp)
 {
        struct nve_softc *sc = ifp->if_softc;
+
+       NVE_LOCK(sc);
+       /* Check for lost interrupts -- happens on nForce4 */
+       sc->hwapi->pfnDisableInterrupts(sc->hwapi->pADCX);
+       sc->hwapi->pfnHandleInterrupt(sc->hwapi->pADCX);
+       sc->hwapi->pfnEnableInterrupts(sc->hwapi->pADCX);
+
+       if (sc->pending_txs == 0) {
+               NVE_UNLOCK(sc);
+               return; /* Problem went away */
+       }
+       NVE_UNLOCK(sc);

        device_printf(sc->dev, "device timeout (%d)\n", sc->pending_txs);
>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list