Lock up problems with 5.3-STABLE (was: Cannot build kernel with
options WITNESS)
Artem Kuchin
matrix at itlegion.ru
Mon Jan 24 04:31:03 PST 2005
> On Sun, 23 Jan 2005, Artem Kuchin wrote:
>
>> > On Sat, 22 Jan 2005, Artem Kuchin wrote:
>> >
>> >> I cvssed just an hour ago. 5.3-STABLE and cannot build kernel with
>> >> WITNES. It complains:
>> >
>> > This occurs when building WITNESS without DDB in the kernel, which was not
>> > a tested build case when I added "show alllocks", and apparently is a
>> > relatively uncommon configuration as you're the first person to bump into
>> > it. I've just committed the fix as subr_witness.c:1.187 in HEAD, and
>> > subr_witness.c:1.178.2.4 in RELENG_5. Please let me know if this doesn't
>> > fix the problem for you.
>>
>> It fixed the problem. I am actually stuggling with unpredictable weird
>> lock ups, when the host can be pinged but i cannot connect via
>> ssh/telnet or httpd or anything else. It happens w/o any visible reason.
>> I am running several jails with mysql and apache in each and canot make
>> the whole system stable yet.
>
> This is typically a sign of one of two problems:
>
> - The system is live locked due to very high load, so the ithread,
> netisrs, etc, in the kernel run fine, but user processes don't get a
> chance to run.
>
> - The system is dead locked due to user space processes getting wedged on
> common locks, but the kernel ithreads and netisrs can keep on
> responding.
>
> I generally assume that it's a deadlock as opposed to a live lock. I'd
> compile a kernel with DDB, KDB, WITNESS, and BREAK_TO_DEBUGGER. When the
> system appears to wedge, break into the debugger using a console or serial
> break (FYI: serial break is more reliable, and you get the benefit of
> being able to easily copy and paste debugging output using the serial
> console for DDB). Use "show alllocks" and "show lockedvnods" to examine
> most of the system's locking state. Changes are, either all the
> interesting processes are stacked up on VFS or VM locks, since those kinds
> of deadlocks produce the exact symptoms you describe: ping works fine
> because it only hits the netisr, but when you open TCP connections, the
> sshd (etc) block on VM or VFS locks attempting to fork new children or
> access a file in the file system name space. At first, the TCP
> connections will establish but there will be no application data; after a
> bit, they will not even return a SYN/ACK because the listen queue for the
> listen socket has filled.
>
Well, i cvsed and reconpiled the kernel with WITNESS, INVARINATS, turned off
adaptive giant and got a lock today at 7 am. Since the server is remotely controlled
i took my digital camera because i cannot connect serial console to it and went to the server.
I expetced to see some special message about something going wrong, break
into debugger (CTRL+ALT+ESC) and to take some pictures of dumps of console.
But, i saw nothing. The lasrt message on th screen was about ssh loging last evening
and the last message in /var/log/all.log was about entropy save from cron.
I could not break into debugger usinmg CTRL-ALT+ESC. I did nothing. So,
it looked like a hard lock.
At this point i would like to tell the whole story.
We bought this server in may 2004 and decided to extemsively test the hardware
while there were not 5.3. We actually expected it around august. SO, we installed
5-CURRENT and ran high load tests (cpu, memory, disk storage, network) from
/usr/ports/benchmark at the same time and one-by-one several weeks. There were
not a glitch. After that we turned it off and waited for RELEASE. RELEASE has
come and we begun to setup the servre as it should work. As the server's
primary mission is to host a buch of site we decided to setup jails for each site,
So we did in december and put the server on prividers co-location severals
kilometer away from the office. Next day the server locked up. We were surprised
but just rebooted it, It locked up the next day gain. We cvsupped and rebuild the
system and the jails. The server locked up the next day. During the new year break
i have figureed that if there are more that one jail running the server locks withun
24 hours with very hight probablity and within 48 hours with 100% probability. I
wrote into freebsd-stable about it. You have asked for debugger dump (pcpu, list of
lock, e.t.c). I could not do it at that time, so, i did not reply and just cvsupped in
the beginning of january and rebuilt the system and the jails again. Magically, after
that i could run 5 jails (did not tried more) for over a week and i already decided that
the bug was fixed and I could host the site. Alas, the next glitch did not wait to long.
After a few more days i saw a srange situatuon - i could not connect to server using
SSH. SSH replied about auth key or something like that. I rebootied the system and
ssh worked fine. Still have no idea what that was, but i setuo IPFIREWALL and a telnet
server for accept connection only from one ip address, so, if ssh fails I could use telnet.
After that i moved a real site with perl scripts, 1GB database, mail account (using qmail+vpopmail)
into one of the jails and the next day got the next problem: I could ping server, but could not
connect using ssh, www, telnet (110,25,23). I tried to recompile the kernel with INVARINATS,
WITNESS and disable the adaptive giant. I could not, so I wrote about it to you. You fixed
the source and now i recompliled the source again and today got a lock again with all those
options enabled and this time i could not ping the server.
I could thing that there is semething wrong with the hardware, but it passed
many days of testing. Anyway, my current idea are
1) Something wrong with jail code
2) Something wrong with SMP code
3) Something wrong with HYPERTHREADING code
4) Something wrong with Memory disk code (md device, which i use)
5) Something wrong with the hardware
So, today, i opened bios, truned off hyperthreaading, fast strinmg operations and
all other 'more advanced' features in the bios. Turned off IDE controller the motherboard.
This rule out HYPERTHREADING code problem and somewaht hardware problem.
I turned off MD usage (not more memory disk, but actually i need it very badly).
So i rule out the md code problem.
Now, i will run some web access test (simulation of browsing for a week). It the
sever does not lock up, i will consider that i have found a workaround for some
hidden bug and the bug is somewere in md, ht code or hardware.
If it locks up again the i will giveup jails and try for one more week. If it does not
lock up - jail code is the problem.
If it locks up without jails, then i will turn off SMP and try again.
If it locks up without nothing, then hardware if faulty and will have futher
choice of hanging myself or shooting in the head.
I would like to see your and others' comments on the story and i have one
more question: what does options _KPOSIX_PRIORITY_SCHEDULING
do? May it be somehow related to the problem?
The hardware is:
MB dual xeon Supermicro X5DPE-G2
CPU P4 XEON 2,667Ghz 512Kb cache 533mhz socket 604
2 Gb 266Mhz, DDR, ECC, Reg, 1GB dimm
4 HDDs 120Gb (seagate baracuda 7200.7)
3Ware Escalade 8506-4LP
Case Supermicro SC822T-550LP
Slim DVD/CD-RW Toshiba SD-R2412B IDE (OEM)
The todays kernel CONFIG wich got locked:
machine i386
cpu I486_CPU
cpu I586_CPU
cpu I686_CPU
ident OMNI2
options SMP
options QUOTA
options SCHED_4BSD # 4BSD scheduler
options INET # InterNETworking
options INET6 # IPv6 communications protocols
options FFS # Berkeley Fast Filesystem
options SOFTUPDATES # Enable FFS soft updates support
options UFS_ACL # Support for access control lists
options UFS_DIRHASH # Improve performance on big directories
#options MD_ROOT # MD is a potential root device
#options NFSCLIENT # Network Filesystem Client
#options NFSSERVER # Network Filesystem Server
#options NFS_ROOT # NFS usable as /, requires NFSCLIENT
options MSDOSFS # MSDOS Filesystem
options CD9660 # ISO 9660 Filesystem
options PROCFS # Process filesystem (requires PSEUDOFS)
options PSEUDOFS # Pseudo-filesystem framework
options GEOM_GPT # GUID Partition Tables.
options COMPAT_43 # Compatible with BSD 4.3 [KEEP THIS!]
options COMPAT_FREEBSD4 # Compatible with FreeBSD4
#options SCSI_DELAY=15000 # Delay (in ms) before probing SCSI
options KTRACE # ktrace(1) support
options SYSVSHM # SYSV-style shared memory
options SYSVMSG # SYSV-style message queues
options SYSVSEM # SYSV-style semaphores
options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
#options KBD_INSTALL_CDEV # install a CDEV entry in /dev
device apic # I/O APIC
# Bus support. Do not remove isa, even if you have no isa slots
device isa
device pci
# Floppy drives
device fdc
# ATA and ATAPI devices
device ata
device atadisk # ATA disk drives
device ataraid # ATA RAID drives
device atapicd # ATAPI CDROM drives
#device atapifd # ATAPI floppy drives
#device atapist # ATAPI tape drives
options ATA_STATIC_ID # Static device numbering
# SCSI peripherals
device scbus # SCSI bus (required for SCSI)
device da # Direct Access (disks)
device pass # Passthrough device (direct SCSI access)
device twe # 3ware ATA RAID
# atkbdc0 controls both the keyboard and the PS/2 mouse
device atkbdc # AT keyboard controller
device atkbd # AT keyboard
device psm # PS/2 mouse
device vga # VGA video card driver
device splash # Splash screen and screen saver support
# syscons is the default console driver, resembling an SCO console
device sc
device agp # support several AGP chipsets
# Floating point support - do not disable.
device npx
# Power management support (see NOTES for more options)
#device apm
# Add suspend/resume support for the i8254.
#device pmtimer
# Serial (COM) ports
device sio # 8250, 16[45]50 based serial ports
# Parallel port
device ppc
device ppbus # Parallel port bus (required)
device lpt # Printer
device ppi # Parallel port interface device
#device vpo # Requires scbus and da
device miibus # MII bus support
device fxp # Intel EtherExpress PRO/100B (82557, 82558)
device em
device loop # Network loopback
device mem # Memory and kernel memory devices
device io # I/O device
device random # Entropy device
device ether # Ethernet support
#device sl # Kernel SLIP
#device ppp # Kernel PPP
device tun # Packet tunnel.
device pty # Pseudo-ttys (telnet etc)
device md # Memory "disks"
#device gif # IPv6 and IPv4 tunneling
#device faith # IPv6-to-IPv4 relaying (translation)
device bpf # Berkeley packet filter
# USB support
device uhci # UHCI PCI->USB interface
device ohci # OHCI PCI->USB interface
device usb # USB Bus (required)
#device udbp # USB Double Bulk Pipe devices
device ugen # Generic
device uhid # "Human Interface Devices"
device ulpt # Printer
device umass # Disks/Mass storage - Requires scbus and da
# FireWire support
device firewire # FireWire bus code
#device sbp # SCSI over FireWire (Requires scbus and da)
#device fwe # Ethernet over FireWire (non-standard!)
options IPFIREWALL
options IPFIREWALL_VERBOSE
options IPFIREWALL_VERBOSE_LIMIT=10000
options IPFIREWALL_DEFAULT_TO_ACCEPT
device snp
device speaker
options DDB
options KDB
options BREAK_TO_DEBUGGER
options INVARIANT_SUPPORT
options INVARIANTS
options WITNESS
options WITNESS_KDB
options WITNESS_SKIPSPIN
#options ADAPTIVE_GIANT # Giant mutex is adaptive.
DMESG (the config which got locked):
Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.3-STABLE #3: Sun Jan 23 01:04:00 MSK 2005
matrix at omni2.itlegion.ru:/usr/obj/usr/src/sys/OMNI2
WARNING: WITNESS option enabled, expect reduced performance.
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 2.66GHz (2665.93-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0xf25 Stepping = 5
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,S
SE2,SS,HTT,TM,PBE>
Hyperthreading: 2 logical CPUs
real memory = 4160225280 (3967 MB)
avail memory = 4077486080 (3888 MB)
ACPI APIC Table: <PTLTD APIC >
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
cpu2 (AP): APIC ID: 6
cpu3 (AP): APIC ID: 7
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 24-47 on motherboard
ioapic2 <Version 2.0> irqs 48-71 on motherboard
ioapic3 <Version 2.0> irqs 72-95 on motherboard
ioapic4 <Version 2.0> irqs 96-119 on motherboard
npx0: [FAST]
npx0: <math processor> on motherboard
npx0: INT 16 interface
acpi0: <PTLTD RSDT> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0
cpu0: <ACPI CPU (2 Cx states)> on acpi0
cpu1: <ACPI CPU (2 Cx states)> on acpi0
cpu2: <ACPI CPU (2 Cx states)> on acpi0
cpu3: <ACPI CPU (2 Cx states)> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pci0: <unknown> at device 0.1 (no driver attached)
pcib1: <ACPI PCI-PCI bridge> at device 2.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pci1: <base peripheral, interrupt controller> at device 28.0 (no driver attached)
pcib2: <ACPI PCI-PCI bridge> at device 29.0 on pci1
pci2: <ACPI PCI bus> on pcib2
pci1: <base peripheral, interrupt controller> at device 30.0 (no driver attached)
pcib3: <ACPI PCI-PCI bridge> at device 31.0 on pci1
pci3: <ACPI PCI bus> on pcib3
em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port 0x3000-0x303f mem 0xfc200000-0xfc21ffff irq 28 at device 2
.0 on pci3
em0: Ethernet address: 00:30:48:2a:2d:bc
em0: Speed:N/A Duplex:N/A
em1: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port 0x3040-0x307f mem 0xfc220000-0xfc23ffff irq 29 at device 2
.1 on pci3
em1: Ethernet address: 00:30:48:2a:2d:bd
em1: Speed:N/A Duplex:N/A
pcib4: <ACPI PCI-PCI bridge> at device 3.0 on pci0
pci4: <ACPI PCI bus> on pcib4
pci4: <base peripheral, interrupt controller> at device 28.0 (no driver attached)
pcib5: <ACPI PCI-PCI bridge> at device 29.0 on pci4
pci5: <ACPI PCI bus> on pcib5
pci4: <base peripheral, interrupt controller> at device 30.0 (no driver attached)
pcib6: <ACPI PCI-PCI bridge> at device 31.0 on pci4
pci6: <ACPI PCI bus> on pcib6
twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0x4000-0x400f mem 0xfc800000-0xfcffffff irq 72 at device 1.
0 on pci6
twe0: [GIANT-LOCKED]
twe0: 4 ports, Firmware FE7S 1.05.00.063, BIOS BE7X 1.08.00.048
uhci0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> port 0x2000-0x201f irq 16 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> port 0x2020-0x203f irq 19 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <Intel 82801CA/CAM (ICH3) USB controller USB-C> port 0x2040-0x205f irq 18 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
usb2: <Intel 82801CA/CAM (ICH3) USB controller USB-C> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
pcib7: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci7: <ACPI PCI bus> on pcib7
pci7: <display, VGA> at device 1.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH3 UDMA100 controller> port 0x2060-0x206f,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0
ata0: channel #0 on atapci0
ata2: channel #1 on atapci0
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
acpi_button0: <Power Button> on acpi0
speaker0: <PC speaker> port 0x61 on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
atkbd0: [GIANT-LOCKED]
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0
fdc0: [FAST]
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
orm0: <ISA Option ROMs> at iomem 0xe0000-0xe3fff,0xc9000-0xc9fff,0xc8000-0xc8fff,0xc0000-0xc7fff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 10.000 msec
ipfw2 initialized, divert disabled, rule-based forwarding disabled, default to accept, logging limited to 10000 packets/entry
by default
acd0: CDRW <TOSHIBA DVD-ROM SD-R2412/1015> at ata0-slave UDMA33
twed0: <Unit 0, RAID5, Normal> on twe0
twed0: 343417MB (703318656 sectors)
SMP: AP CPU #2 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
Mounting root from ufs:/dev/twed0s1a
em0: Link is up 100 Mbps Full Duplex
More information about the freebsd-stable
mailing list