FreeBSD -STABLE servers repeatedly crashing.

Blaz Zupan blaz at si.FreeBSD.org
Wed Jul 6 07:40:27 GMT 2005


On Fri, 1 Jul 2005, Kris Kennaway wrote:
>> On Tue, Jun 28, 2005 at 11:26:06AM -0400, Matt Juszczak wrote:
>>> After CPUID: 1, the machine locks cold and nothing else is printed to
>>> the screen.
>>
>> Try two things:
>>
>> 1) adding 'options KDB_STOP_NMI' to your kernel config.
>
> I just learned that you also need to set the
> debug.kdb.stop_cpus_with_nmi=1 sysctl (e.g. in sysctl.conf).

I'm experiencing the same crashes as Matt, but on 5.4-RELEASE-p3. The machine 
is a HP DL380 G3 and it is heavily loaded (postfix mail server running 
amavisd-new with antivirus and antispam, so it has heavy IO and CPU load). It 
does not survive more than a couple of hours, while it is rock stable running 
4.11. We have four machines like this, three of them are now again running 
4.11 and we left the fourth one at 5.4. We have two other DL380 servers 
working on our outbound mail queue, but they are not SMP and they are rock 
stable on 5.4.

Without KDB_STOP_NMI, the machine was basically stuck after a crash.

Now I've finally landed in the kernel debugger and I have a trace from DDB and 
have also been able to generate a crashdump with "call doadump".

If a developer is willing to investigate, I have:
- the vmcore file from the crash (its size is 1GB)
- the corresponding kernel, compiled with debug symbols
- a GIF of the console at the time of the crash with the backtrace at the time
   of crash
- a dmesg from the box (see below)
- the kernel config file

Please contact me if you want to investigate this further.

Just in case, here is a dmesg from the box:

Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
 	The Regents of the University of California. All rights reserved.
FreeBSD 5.4-RELEASE-p3 #0: Tue Jul  5 18:37:15 CEST 2005
     blaz at bigbrother.amis.net:/usr/obj/usr/src5/sys/DL380
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 3.06GHz (3049.93-MHz 686-class CPU)
   Origin = "GenuineIntel"  Id = 0xf29  Stepping = 9
   Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
   Hyperthreading: 2 logical CPUs
real memory  = 1073717248 (1023 MB)
avail memory = 1045372928 (996 MB)
ACPI APIC Table: <COMPAQ 00000083>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
  cpu0 (BSP): APIC ID:  0
  cpu1 (AP): APIC ID:  1
  cpu2 (AP): APIC ID:  6
  cpu3 (AP): APIC ID:  7
MADT: Forcing active-low polarity and level trigger for SCI
ioapic0 <Version 1.1> irqs 0-15 on motherboard
ioapic1 <Version 1.1> irqs 16-31 on motherboard
ioapic2 <Version 1.1> irqs 32-47 on motherboard
ioapic3 <Version 1.1> irqs 48-63 on motherboard
npx0: <math processor> on motherboard
npx0: INT 16 interface
acpi0: <COMPAQ P29> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000
acpi_timer0: <32-bit timer at 3.579545MHz> port 0x920-0x923 on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> on acpi0
pci0: <ACPI PCI bus> on pcib0
pci0: <display, VGA> at device 3.0 (no driver attached)
pci0: <base peripheral> at device 4.0 (no driver attached)
pci0: <base peripheral> at device 4.2 (no driver attached)
isab0: <PCI-ISA bridge> at device 15.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <ServerWorks CSB5 UDMA100 controller> port 0x2000-0x200f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 15.1 on pci0
ata0: channel #0 on atapci0
ata1: channel #1 on atapci0
ohci0: <OHCI (generic) USB controller> mem 0xf5ef0000-0xf5ef0fff irq 7 at device 15.2 on pci0
usb0: OHCI version 1.0, legacy support
usb0: SMM does not respond, resetting
usb0: <OHCI (generic) USB controller> on ohci0
usb0: USB revision 1.0
uhub0: (0x1166) OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 4 ports with 4 removable, self powered
pcib1: <ACPI Host-PCI bridge> on acpi0
pci1: <ACPI PCI bus> on pcib1
ciss0: <Compaq Smart Array 5i> port 0x3000-0x30ff mem 0xf7cf0000-0xf7cf3fff,0xf7dc0000-0xf7dfffff irq 30 at device 3.0 on pci1
pcib2: <ACPI Host-PCI bridge> on acpi0
pci2: <ACPI PCI bus> on pcib2
bge0: <Broadcom BCM5703 Gigabit Ethernet, ASIC rev. 0x1002> mem 0xf7ef0000-0xf7efffff irq 29 at device 1.0 on pci2
miibus0: <MII bus> on bge0
brgphy0: <BCM5703 10/100/1000baseTX PHY> on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge0: Ethernet address: 00:0e:7f:20:22:91
bge1: <Broadcom BCM5703 Gigabit Ethernet, ASIC rev. 0x1002> mem 0xf7ee0000-0xf7eeffff irq 31 at device 2.0 on pci2
miibus1: <MII bus> on bge1
brgphy1: <BCM5703 10/100/1000baseTX PHY> on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge1: Ethernet address: 00:0e:7f:20:22:90
pcib3: <ACPI Host-PCI bridge> on acpi0
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI Host-PCI bridge> on acpi0
pci6: <ACPI PCI bus> on pcib4
pci6: <base peripheral, PCI hot-plug controller> at device 30.0 (no driver attached)
acpi_tz0: <Thermal Zone> on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model Generic PS/2 mouse, device ID 0
sio0: <Standard PC COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A
fdc0: <floppy drive controller (FDE)> port 0x3f2-0x3f5 irq 6 drq 2 on acpi0
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
orm0: <ISA Option ROMs> at iomem 0xee000-0xeffff,0xcc000-0xcd7ff,0xc8000-0xcbfff,0xc0000-0xc7fff on isa0
pmtimer0 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 10.000 msec
IP Filter: v3.4.35 initialized.  Default = pass all, Logging = enabled
acd0: CDROM <COMPAQ CD-ROM SN-124/N104> at ata0-master PIO4
SMP: AP CPU #3 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #2 Launched!
da0 at ciss0 bus 0 target 0 lun 0
da0: <COMPAQ RAID 5  VOLUME OK> Fixed Direct Access SCSI-0 device 
da0: 135.168MB/s transfers
da0: 69455MB (142245120 512 byte sectors: 255H 32S/T 17432C)
Mounting root from ufs:/dev/da0s1a
WARNING: / was not properly dismounted
WARNING: /usr was not properly dismounted
WARNING: /var was not properly dismounted
WARNING: /spool was not properly dismounted
/spool: mount pending error: blocks 5484 files 14


More information about the freebsd-stable mailing list