Random panics in 11.0 and 12.0 on J1900

James Snow snow at teardrop.org
Wed Jul 10 16:26:20 UTC 2019


I have a set of J1900 hosts running 11.0-RELEASE-p1 that experience
seemingly random panics. The panics are all basically the same:

Fatal trap 12: page fault while in kernel mode
fault code = supervisor read data, page not present

Adding workloads to the hosts seems to increase panic frequency, but the
panics have also occurred on completely idle hosts. Similarly, uptime
when panicking has been as low as minutes, and as high as ~620 days.

For reasons, it has not been possible to extract a coredump from these
hosts, nor practical to run memtest on them or upgrade them to a newer
release. About 1% of our hosts are affected each day, so we've just been
living with the problem.

However, while testing 12.0 on the same hardware, I encountered the same
panic and was able to capture the core dump. (See below.)

All of my Google-fu on this panic has turned up threads suggesting the
problem is hardware, but there are two problems with that idea...

One, memtest has turned up no errors on 12.0 host I witnessed the panic
on.

Two, a small number of systems on the same hardware are running
10.3-RELEASE, and have experienced no panics in their history. Panics
have only happened on 11s, and now 12.

kgdb output from the panic follows. (This particular host was in the
middle of rebooting when it panicked.)

Hoping someone here has some insight. My uninformed wild-ass guess is
something relating to spectre/meltdown fixes.

Thanks,


-Snow



root at j1900_12:~ # kgdb /boot/kernel/kernel /var/crash/vmcore.0
GNU gdb (GDB) 8.3 [GDB v8.3 for FreeBSD]
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd12.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:
<118>.
<118>Terminated
<118>Jul 10 07:03:50 j1900_12 syslogd: last message repeated 9 times
<118>Jul 10 07:04:08 j1900_12 syslogd: exiting on signal 15
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining... 0 0 0 0 done
Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-3' to stop... done
All buffers synced.
Uptime: 23h22m43s
umass0: detached
ukbd0: detached
uhid0: detached
uhub3: detached
uhub2: detached


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x3201c450
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80b7ad1d
stack pointer	        = 0x28:0xfffffe003f231820
frame pointer	        = 0x28:0xfffffe003f231890
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 1 (init)
trap number		= 12
panic: page fault
cpuid = 0
time = 1562742255
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f1d5 at calltrap+0x8
#7 0xffffffff808a6029 at re_shutdown+0x99
#8 0xffffffff80bd878a at bus_generic_shutdown+0x5a
#9 0xffffffff80bd878a at bus_generic_shutdown+0x5a
#10 0xffffffff80bd878a at bus_generic_shutdown+0x5a
#11 0xffffffff80bd878a at bus_generic_shutdown+0x5a
#12 0xffffffff80bd878a at bus_generic_shutdown+0x5a
#13 0xffffffff80452a8d at acpi_shutdown+0xd
#14 0xffffffff80bd878a at bus_generic_shutdown+0x5a
#15 0xffffffff80bd878a at bus_generic_shutdown+0x5a
#16 0xffffffff80bdbb6e at root_bus_module_handler+0x11e
#17 0xffffffff80b7a86f at module_shutdown+0x6f
Uptime: 23h22m44s
Dumping 494 out of 7976 MB:..4%..13%..23%..33%..43%..52%..62%..72%..81%..91%

__curthread () at ./machine/pcpu.h:230
230	./machine/pcpu.h: No such file or directory.
(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:230
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80b9b14b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:446
#3  0xffffffff80b9b5c3 in vpanic (fmt=<optimized out>, ap=0xfffffe003f231570) at /usr/src/sys/kern/kern_shutdown.c:872
#4  0xffffffff80b9b3b3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:799
#5  0xffffffff8107496f in trap_fatal (frame=0xfffffe003f231760, eva=838976592) at /usr/src/sys/amd64/amd64/trap.c:929
#6  0xffffffff810749c9 in trap_pfault (frame=0xfffffe003f231760, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:765
#7  0xffffffff81073fee in trap (frame=0xfffffe003f231760) at /usr/src/sys/amd64/amd64/trap.c:441
#8  <signal handler called>
#9  __mtx_lock_sleep (c=0xfffffe00493fa230, v=<optimized out>) at /usr/src/sys/kern/kern_mutex.c:565
#10 0xffffffff808a6029 in re_shutdown (dev=<optimized out>) at /usr/src/sys/dev/re/if_re.c:3772
#11 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262
#12 device_shutdown (dev=0xfffff800037d9100) at /usr/src/sys/kern/subr_bus.c:3065
#13 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760
#14 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262
#15 device_shutdown (dev=0xfffff800037d9200) at /usr/src/sys/kern/subr_bus.c:3065
#16 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760
#17 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262
#18 device_shutdown (dev=0xfffff80003626900) at /usr/src/sys/kern/subr_bus.c:3065
#19 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760
#20 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262
#21 device_shutdown (dev=0xfffff80003627400) at /usr/src/sys/kern/subr_bus.c:3065
#22 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760
#23 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262
#24 device_shutdown (dev=0xfffff8000355f300) at /usr/src/sys/kern/subr_bus.c:3065
#25 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760
#26 0xffffffff80452a8d in acpi_shutdown (dev=0xfffffe00493fa230) at /usr/src/sys/dev/acpica/acpi.c:758
#27 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262
#28 device_shutdown (dev=0xfffff80003560400) at /usr/src/sys/kern/subr_bus.c:3065
#29 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760
#30 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262
#31 device_shutdown (dev=0xfffff8000334ea00) at /usr/src/sys/kern/subr_bus.c:3065
#32 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760
#33 0xffffffff80bdbb6e in DEVICE_SHUTDOWN (dev=0xfffff8000337dd00) at ./device_if.h:262
#34 device_shutdown (dev=0xfffff8000337dd00) at /usr/src/sys/kern/subr_bus.c:3065
#35 root_bus_module_handler (mod=<optimized out>, what=<optimized out>, arg=<optimized out>) at /usr/src/sys/kern/subr_bus.c:4951
#36 0xffffffff80b7a86f in module_shutdown (arg1=<optimized out>, arg2=<optimized out>) at /usr/src/sys/kern/kern_module.c:104
#37 0xffffffff80b9b1da in kern_reboot (howto=0) at /usr/src/sys/kern/kern_shutdown.c:449
#38 0xffffffff80b9acb1 in sys_reboot (td=0xfffff80003320580, uap=0xfffff80003320940) at /usr/src/sys/kern/kern_shutdown.c:280
#39 0xffffffff81075449 in syscallenter (td=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#40 amd64_syscall (td=0xfffff80003320580, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1076
#41 <signal handler called>
#42 0x0000000000244e4a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe6e8
(kgdb) list *0xffffffff80b7ad1d
0xffffffff80b7ad1d is in __mtx_lock_sleep (/usr/src/sys/kern/kern_mutex.c:565).
560			/*
561			 * If the owner is running on another CPU, spin until the
562			 * owner stops running or the state of the lock changes.
563			 */
564			owner = lv_mtx_owner(v);
565			if (TD_IS_RUNNING(owner)) {
566				if (LOCK_LOG_TEST(&m->lock_object, 0))
567					CTR3(KTR_LOCK,
568					    "%s: spinning on %p held by %p",
569					    __func__, m, owner);
(kgdb)


More information about the freebsd-stable mailing list