ZFS + nullfs + Linuxulator = panic?

Wed Feb 22 13:54:37 UTC 2012

On Feb 17, 2012, at 8:27 PM, Konstantin Belousov wrote:

> On Thu, Feb 16, 2012 at 12:07:46PM -0500, Paul Mather wrote:
>> On Feb 16, 2012, at 10:49 AM, Konstantin Belousov wrote:
>> 
>>> On Thu, Feb 16, 2012 at 10:09:27AM -0500, Paul Mather wrote:
>>>> On Feb 14, 2012, at 7:47 PM, Konstantin Belousov wrote:
>>>> 
>>>>> On Tue, Feb 14, 2012 at 09:38:18AM -0500, Paul Mather wrote:
>>>>>> I have a problem with RELENG_8 (FreeBSD/amd64 running a GENERIC kernel, last built 2012-02-08).  It will panic during the daily periodic scripts that run at 3am.  Here is the most recent panic message:
>>>>>> 
>>>>>> Fatal trap 9: general protection fault while in kernel mode
>>>>>> cpuid = 0; apic id = 00
>>>>>> instruction pointer     = 0x20:0xffffffff8069d266
>>>>>> stack pointer           = 0x28:0xffffff8094b90390
>>>>>> frame pointer           = 0x28:0xffffff8094b903a0
>>>>>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>>>>>                      = DPL 0, pres 1, long 1, def32 0, gran 1
>>>>>> processor eflags        = resume, IOPL = 0
>>>>>> current process         = 72566 (ps)
>>>>>> trap number             = 9
>>>>>> panic: general protection fault
>>>>>> cpuid = 0
>>>>>> KDB: stack backtrace:
>>>>>> #0 0xffffffff8062cf8e at kdb_backtrace+0x5e
>>>>>> #1 0xffffffff805facd3 at panic+0x183
>>>>>> #2 0xffffffff808e6c20 at trap_fatal+0x290
>>>>>> #3 0xffffffff808e715a at trap+0x10a
>>>>>> #4 0xffffffff808cec64 at calltrap+0x8
>>>>>> #5 0xffffffff805ee034 at fill_kinfo_thread+0x54
>>>>>> #6 0xffffffff805eee76 at fill_kinfo_proc+0x586
>>>>>> #7 0xffffffff805f22b8 at sysctl_out_proc+0x48
>>>>>> #8 0xffffffff805f26c8 at sysctl_kern_proc+0x278
>>>>>> #9 0xffffffff8060473f at sysctl_root+0x14f
>>>>>> #10 0xffffffff80604a2a at userland_sysctl+0x14a
>>>>>> #11 0xffffffff80604f1a at __sysctl+0xaa
>>>>>> #12 0xffffffff808e62d4 at amd64_syscall+0x1f4
>>>>>> #13 0xffffffff808cef5c at Xfast_syscall+0xfc
>>>>> 
>>>>> Please look up the line number for the fill_kinfo_thread+0x54.
>>>> 
>>>> 
>>>> Is there a way for me to do this from the above information? As
>>>> I said in the original message, I failed to get a crash dump after
>>>> reboot (because, it turns out, I hadn't set up my gmirror swap device
>>>> properly). Alas, with the latest panic, it appears to have hung[1]
>>>> during the "Dumping" phase, so it looks like I won't get a saved crash
>>>> dump this time, either. :-(
>>> 
>>> Load the kernel.debug into kgdb, and from there do
>>> "list *fill_kinfo_thread+0x54".
>> 
>> 
>> gromit# kgdb /usr/obj/usr/src/sys/GENERIC/kernel.debug
>> GNU gdb 6.1.1 [FreeBSD]
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>> This GDB was configured as "amd64-marcel-freebsd"...
>> (kgdb) list *fill_kinfo_thread+0x54
>> 0xffffffff805ee034 is in fill_kinfo_thread (/usr/src/sys/kern/kern_proc.c:854).
>> 849             thread_lock(td);
>> 850             if (td->td_wmesg != NULL)
>> 851                     strlcpy(kp->ki_wmesg, td->td_wmesg, sizeof(kp->ki_wmesg));
>> 852             else
>> 853                     bzero(kp->ki_wmesg, sizeof(kp->ki_wmesg));
>> 854             strlcpy(kp->ki_ocomm, td->td_name, sizeof(kp->ki_ocomm));
>> 855             if (TD_ON_LOCK(td)) {
>> 856                     kp->ki_kiflag |= KI_LOCKBLOCK;
>> 857                     strlcpy(kp->ki_lockname, td->td_lockname,
>> 858                         sizeof(kp->ki_lockname));
>> (kgdb) 
> 
> This is indeed strange. It can only occur if td pointer is damaged.
> 
> Please, try to get a core and at least print the content of *td in this case.

Another panic last night, after reverting "dsmc schedule" scripts to use "/bin/sh" (actually /compat/linux/bin/sh):

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x308
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff806026ef
stack pointer           = 0x28:0xffffff8094af02d0
frame pointer           = 0x28:0xffffff8094af0350
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 90872 (df)
trap number             = 12
panic: page fault
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff8062cf8e at kdb_backtrace+0x5e
#1 0xffffffff805facd3 at panic+0x183
#2 0xffffffff808e6c20 at trap_fatal+0x290
#3 0xffffffff808e6f71 at trap_pfault+0x201
#4 0xffffffff808e742f at trap+0x3df
#5 0xffffffff808cec64 at calltrap+0x8
#6 0xffffffff80602e1e at _sx_xlock+0x4e
#7 0xffffffff80f9ca35 at rrw_enter+0xa5
#8 0xffffffff80f9ce86 at zfs_statfs+0x46
#9 0xffffffff80681258 at __vfs_statfs+0x28
#10 0xffffffff81476521 at nullfs_statfs+0x51
#11 0xffffffff80681258 at __vfs_statfs+0x28
#12 0xffffffff80690b22 at kern_statfs+0x1b2
#13 0xffffffff80690c77 at statfs+0x37
#14 0xffffffff808e62d4 at amd64_syscall+0x1f4
#15 0xffffffff808cef5c at Xfast_syscall+0xfc

Alas, the system became "hung" here: there is no further output indicating memory being dumped to the dumpdev and no core dump was found during subsequent (forced) reboot. :-(

Note that this panic is different to the previous one.  Also, the presence of nullfs_statfs in the backtrace above is very curious.  According to my logs, the daily backup had already finished successfully and thus the nullfs-mounted file systems would have been unmounted before the system panicked:

02/22/12   02:08:11 --- SCHEDULEREC STATUS END
02/22/12   02:08:11 --- SCHEDULEREC OBJECT END DESKTOP_DAILY_BACKUP 02/22/12   02:00:00
02/22/12   02:08:11 
Executing Operating System command or script:
   /usr/local/bin/remove_zfs_backup_snapshot
02/22/12   02:08:12 Finished command.  Return code is: 0
02/22/12   02:08:12 Scheduled event 'DESKTOP_DAILY_BACKUP' completed successfully.
02/22/12   02:08:12 Sending results for scheduled event 'DESKTOP_DAILY_BACKUP'.
02/22/12   02:08:12 Results sent to server for scheduled event 'DESKTOP_DAILY_BACKUP'.

02/22/12   02:08:12 ANS1483I Schedule log pruning started.
02/22/12   02:08:15 ANS1484I Schedule log pruning finished successfully.

Other logs indicate that the system was up until 3 am, whereupon, presumably, "periodic daily" precipitated a panic somewhere during its execution.

Cheers,

Paul.