Watchdog timeouts and dead network on bge - 6.1-RC1

Chris chrcoluk at gmail.com
Sun Apr 23 23:38:22 UTC 2006


On 23/04/06, Robert Watson <rwatson at freebsd.org> wrote:
> On Sun, 23 Apr 2006, Lars Erik Gullerud wrote:
>
> > We recently upgraded one of our 4.11 servers to 6.1-RC1. The server is a
> > Dell PE2650, dual Xeons, and has two onboard Broadcom BCM5701 cards, using
> > the bge driver.
> >
> > Some older threads on -net and -current led me to believe that most issues
> > with bge driver in FreeBSD >4 had been sorted. However, after our upgrade,
> > we are seing errors like this:
>
> There's a Dell 2650 in the FreeBSD netperf cluster.  When working with 5.x on
> the box quite a long time ago, I saw similar problems, in which the network
> interface stalled and required kicking to reset.  Unfortunately, this is not
> an issue I have time to work on currently, but if it would help a FreeBSD
> developer track down and debug this problem, I can provide remote access to a
> box that has had the problem in the past, along with serial console, remote
> power, and network booting.  I'll run some tests on it today and see if that
> box still has the same problem or not.  I've never been entirely convinced it
> was actually a bge problem as opposed to an interrupt delivery problem,
> however.  Dmesg fragment below.
>
> Robert N M Watson
>
> Copyright (c) 1992-2005 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
>         The Regents of the University of California. All rights reserved.
> FreeBSD 6.0-CURRENT #1: Sat Jan 29 21:32:42 EST 2005
>     rwatson at zoo.freebsd.org:/usr/obj/zoo/rwatson/netperf/src/sys/GENERIC
> WARNING: WITNESS option enabled, expect reduced performance.
> Timecounter "i8254" frequency 1193182 Hz quality 0
> CPU: Intel(R) XEON(TM) CPU 2.20GHz (2192.90-MHz 686-class CPU)
>   Origin = "GenuineIntel"  Id = 0xf24  Stepping = 4
>
> Features=0x3febfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM>
> real memory  = 2147352576 (2047 MB)
> avail memory = 2096799744 (1999 MB)
> ACPI APIC Table: <DELL   PE2650  >
> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
>  cpu0 (BSP): APIC ID:  0
>  cpu1 (AP): APIC ID:  6
> ioapic0: Changing APIC ID to 8
> ioapic1: Changing APIC ID to 9
> ioapic2: Changing APIC ID to 10
> MADT: Forcing active-low polarity and level trigger for SCI
> ioapic0 <Version 1.1> irqs 0-15 on motherboard
> ioapic1 <Version 1.1> irqs 16-31 on motherboard
> ioapic2 <Version 1.1> irqs 32-47 on motherboard
> ...
> ACPI APIC Table: <DELL   PE2650  >
> acpi0: <DELL PE2650> on motherboard
> aac0: <Dell PERC 3/Di> mem 0xf0000000-0xf7ffffff irq 30 at device 8.1 on pci4
> ...
> bge0: <Broadcom BCM5701 Gigabit Ethernet, ASIC rev. 0x105> mem
> 0xfcd10000-0xfcd1ffff irq 28 at device 6.0 on pci3
> miibus0: <MII bus> on bge0
> bge0: Ethernet address: 00:06:5b:8e:b9:8d
> bge1: <Broadcom BCM5701 Gigabit Ethernet, ASIC rev. 0x105> mem
> 0xfcd00000-0xfcd0ffff irq 29 at device 8.0 on pci3
> miibus1: <MII bus> on bge1
> bge1: Ethernet address: 00:06:5b:8e:b9:8e
>
>
> >
> > Apr 22 18:44:01 nebula kernel: bge0: watchdog timeout -- resetting
> > Apr 22 18:44:01 nebula kernel: bge0: link state changed to DOWN
> > Apr 22 18:44:03 nebula kernel: bge0: link state changed to UP
> >
> > ...and more importantly - when this happens, the network connection does NOT
> > in fact come back up. Logging into the box locally (or via a different
> > network interface) and manually issuing "ifconfig bge0 down ; ifconfig bge0
> > up" DOES get the interface going again, however.
> >
> > We have only seen this on very high network loads - the particular message
> > included above occured while transferring some 120GB of data from a 4.11
> > NFS-server to this 6.1-RC1 box.
> >
> > Is this a known issue in bge? If so, is anyone working on it? Can we provide
> > some useful information to whoever this might be?
> >
> > We have never had any issues with bge in 4.x, but we really need to get this
> > server up to 5.x/6.x at this point in time, any other suggestions on knobs or
> > workarounds that can give us bge stability?
> >
> > Thanks in advance,
> >
> > /leg
> > _______________________________________________

I had this problem on a 6.0 RELEASE server but I got it to stop, the
interface is bge0, the problem only occurs when the card is negotiated
in 10mbit, when we switched back to 100mbit full duplex the problem
went away, we also found that adding aliases with netmask
255.255.255.255 caused bge0 to switch between DOWN and UP, I got no
watchdog errors but we had the DOWN/UP problem.  The box needed a
reboot to come back online again.

Unfortenatly this isnt my server I just admin it for a client so I am
not in a position to give access but I will speak to him and if he is
ok with it I am happy to test patches etc. that may solve this
problem.

Chris


More information about the freebsd-net mailing list