ZFS + nullfs + Linuxulator = panic?

Tue Feb 14 23:49:31 UTC 2012

I have a problem with RELENG_8 (FreeBSD/amd64 running a GENERIC kernel, last built 2012-02-08).  It will panic during the daily periodic scripts that run at 3am.  Here is the most recent panic message:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff8069d266
stack pointer           = 0x28:0xffffff8094b90390
frame pointer           = 0x28:0xffffff8094b903a0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 72566 (ps)
trap number             = 9
panic: general protection fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff8062cf8e at kdb_backtrace+0x5e
#1 0xffffffff805facd3 at panic+0x183
#2 0xffffffff808e6c20 at trap_fatal+0x290
#3 0xffffffff808e715a at trap+0x10a
#4 0xffffffff808cec64 at calltrap+0x8
#5 0xffffffff805ee034 at fill_kinfo_thread+0x54
#6 0xffffffff805eee76 at fill_kinfo_proc+0x586
#7 0xffffffff805f22b8 at sysctl_out_proc+0x48
#8 0xffffffff805f26c8 at sysctl_kern_proc+0x278
#9 0xffffffff8060473f at sysctl_root+0x14f
#10 0xffffffff80604a2a at userland_sysctl+0x14a
#11 0xffffffff80604f1a at __sysctl+0xaa
#12 0xffffffff808e62d4 at amd64_syscall+0x1f4
#13 0xffffffff808cef5c at Xfast_syscall+0xfc
Uptime: 3d19h6m0s
Dumping 1308 out of 2028 MB:..2%..12%..21%..31%..41%..51%..62%..71%..81%..91%
Dump complete
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...

The reason for the subject line is that I have another RELENG_8 system that uses ZFS + nullfs but doesn't panic, leading me to believe that ZFS + nullfs is not the problem.  I am wondering if it is the combination of the three that is deadly, here.

Both RELENG_8 systems are root-on-ZFS installs.  Each night there is a separate backup script that runs and completes before the regular "periodic daily" run.  This script takes a recursive snapshot of the ZFS pool and then mounts these snapshots via mount_nullfs to provide a coherent view of the filesystem under /backup.  The only difference between the two RELENG_8 systems is that one uses rsync to back up /backup to another machine and the other uses the Linux Tivoli TSM client to back up /backup to a TSM server.  After the backup is completed, a script runs that unmounts the nullfs file systems and then destroys the ZFS snapshot.

The first (rsync backup) RELENG_8 system does not panic.  It has been running the ZFS + nullfs rsync backup job without incident for weeks now.  The second (Tivoli TSM) RELENG_8 will reliably panic when the subsequent "periodic daily" job runs.  (It is using the 32-bit TSM 6.2.4 Linux client running "dsmc schedule" via the linux_base-f10-10_4 package.)  The actual ZFS + nullfs Tivoli TSM backup job appears to run successfully, making me wonder if perhaps it has some memory leak or other subtle corruption that sets up the ensuing panic when the "periodic daily" job later gives the system a workout.

If I can provide more information about the panic, please let me know.  Despite the message about dumping in the panic output above, when the system reboots I get a "No core dumps found" message during boot.  (I have dumpdev="AUTO" set in /etc/rc.conf.)  My swap device is on separate partitions but is mirrored using geom_mirror as /dev/mirror/swap.  Do crash dumps to gmirror devices work on RELENG_8?

Does anyone have any idea what is to blame for the panic, or how I can fix or work around it?

Cheers,

Paul.

PS: The uptime of three days in the panic message is because I disabled the Tivoli TSM backup job on Friday so it would not run over the weekend.