Possible fxp(4) problem in -CURRENT

Wed Oct 19 06:59:21 PDT 2005

Dan Bilik wrote:
> On Tue, 18 Oct 2005 20:30:12 +0000 (GMT)
> wpaul at FreeBSD.ORG (Bill Paul) wrote:
> 
> >> Today one of the problem machines got stuck again. I was able to
> >> log on through second functional interface and watch it more
> >> closely. Sending packets from the box worked (its arp requests were
> >> appearing on other boxes in the subnet) but it could not receive
> >> any packet. And another thing... It seems that running tcpdump (ie.
> >> entering and leaving promiscuous mode) on the interface resolved
> >> the problem and made the machine to appear back on the network.
> >> It's running with no problem from that moment.
> > ...
> > - The chip has experienced an RX overrun, where all of the descriptors
> >   in its RX DMA ring have been filled by the chip before the driver
> >   has had a chance to drain them. When this happens, the chip may
> >   require the RX unit to be resumed.
> > - For some reason, the RX handler code in the driver has fallen out
> >   of sync with the chip, i.e. the current descriptor index has gotten
> >   clobbered, or maybe the chip was restarted and the index wasn't
> > properly reset.
> > RX overruns are obviously the result of a very busy network (or a very
> > busy host processor that can't service the NIC frequently enough to
> > drain the RX ring). If the network is busy, it would be with a lot of
> > small packets.
> 
> Yes, it's exactly that case. The box is running boa to serve http
> requests for static content (mostly small to medium size images). There
> are around 1k established short-time connections and 50-70% CPU usage
> for the most of the day. We have also tried polling(4) on the problem
> machines but it didn't help (though we got less CPU usage).
> 
> The same hardware serving the same purposes but running 4.9-RELEASE has
> never got jammed that way. It runs for months without a problem.
> 
> > You should run vmstat -i or something to monitor the interrupt rate
> > on the failing interface and see if it peaks right before it goes
> > deaf.
> 
> OK, I'm going to periodically collect this information on the problem
> boxes. Thanks.

I just checked the code and it appears that fxp(4) indeed doesn't handle
DMA overrun errors.  I've generated a small patch that adds a printf()
call if that happens.  It won't solve the problem since it only does
that, but if you can confirm that the message is printed when the
interface goes deaf I'll write a real patch later to fix the issue (I'm
at work at the moment and kinda busy).

Cheers,
Maxime
-------------- next part --------------

--- if_fxp.c.orig	Wed Oct 19 15:49:58 2005
+++ if_fxp.c	Wed Oct 19 15:54:48 2005
@@ -1641,6 +1641,9 @@
 		}
 #endif /* DEVICE_POLLING */
 
+		if (le16toh(rfa->rfa_status) & FXP_RFA_STATUS_OVERRUN)
+			device_printf(sc->dev, "DMA overrun");
+
 		if ((le16toh(rfa->rfa_status) & FXP_RFA_STATUS_C) == 0)
 			break;