Reproducible ZFS panic,
w/ script (Was: "New" ZFS crash on FS (pool?) unmount/export)
Thomas Backman
serenity at exscape.org
Fri Jul 10 19:01:49 UTC 2009
On Jun 20, 2009, at 09:11, Thomas Backman wrote:
> I just ran into this tonight. Not sure exactly what triggered it -
> the box stopped responding to pings at 02:07AM and it has a cron
> backup job using zfs send/recv at 02:00, so I'm guessing it's
> related, even though the backup probably should have finished before
> then... Hmm. Anyway.
>
> r194478.
>
> kernel trap 12 with interrupts disabled
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address = 0x288
> fault code = supervisor read data, page not present
> instruction pointer = 0x20:0xffffffff805a4989
> stack pointer = 0x28:0xffffff803e8b57e0
> frame pointer = 0x28:0xffffff803e8b5840
> code segment = base 0x0, limit 0xfffff, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags = resume, IOPL = 0
> current process = 57514 (zpool)
> panic: from debugger
> cpuid = 0
> Uptime: 10h22m13s
> Physical memory: 2027 MB
>
> (kgdb) bt
> ...
> #9 0xffffffff80887fb2 in trap (frame=0xffffff803e8b5730) at /usr/
> src/sys/amd64/amd64/trap.c:345
> #10 0xffffffff8086e007 in calltrap () at /usr/src/sys/amd64/amd64/
> exception.S:223
> #11 0xffffffff805a4989 in _sx_xlock_hard (sx=0xffffff0043557d50,
> tid=18446742975830720512, opts=Variable "opts" is not available.
> )
> at /usr/src/sys/kern/kern_sx.c:575
> #12 0xffffffff805a52fe in _sx_xlock (sx=Variable "sx" is not
> available.
> ) at sx.h:155
> #13 0xffffffff80fe2995 in zfs_freebsd_reclaim () from /boot/kernel/
> zfs.ko
> #14 0xffffffff808cefca in VOP_RECLAIM_APV (vop=0xffffff0043557d38,
> a=0xffffff0043557d50) at vnode_if.c:1926
> #15 0xffffffff80626f6e in vgonel (vp=0xffffff00437a7938) at
> vnode_if.h:830
> #16 0xffffffff8062b528 in vflush (mp=0xffffff0060f2a000, rootrefs=0,
> flags=0, td=0xffffff0061528000)
> at /usr/src/sys/kern/vfs_subr.c:2450
> #17 0xffffffff80fdd3a8 in zfs_umount () from /boot/kernel/zfs.ko
> #18 0xffffffff8062420a in dounmount (mp=0xffffff0060f2a000,
> flags=1626513408, td=Variable "td" is not available.
> )
> at /usr/src/sys/kern/vfs_mount.c:1287
> #19 0xffffffff80624975 in unmount (td=0xffffff0061528000,
> uap=0xffffff803e8b5c00)
> at /usr/src/sys/kern/vfs_mount.c:1172
> #20 0xffffffff8088783f in syscall (frame=0xffffff803e8b5c90) at /usr/
> src/sys/amd64/amd64/trap.c:984
> #21 0xffffffff8086e290 in Xfast_syscall () at /usr/src/sys/amd64/
> amd64/exception.S:364
> #22 0x000000080104e49c in ?? ()
> Previous frame inner to this frame (corrupt stack?)
>
> BTW, I got a (one) "force unmount is experimental" on the console.
> On regular shutdown I usually get one per filesystem, it seems (at
> least 10) and this pool should contain exactly as many filesystems
> as the root pool since it's a copy of it. On running the backup
> script manually post-crash, though, I didn't get any.
OK, I've finally written a script that reproduces this panic for me
every time (6-7 tries in a row should be good enough, plus one on
another box). It would be great to have a few testers - and if you do
test it, PLEASE report your results here - positive or negative!
The main aim is, of course, to provide ZFS devs with their own core
dumps, DDB consoles and whatnot to possibly resolve this issue.
It requires:
* bash (or something compatible, for the script itself)
* a box you're willing to crash. ;)
* ~200MB free on your /root/ partition (or just edit the paths at the
top of the ugly hack of a script.)
If you use ggatel with silly high numbers (1482 and 1675 - chosen
since they're unlikely to be used), again, edit at the top.
* The libzfs_sendrecv.c patch - see http://lists.freebsd.org/pipermail/freebsd-current/2009-May/006814.html
or fetch the patch from my server (if it's down, it's for less than
2 minutes - my modem restarts every 181 minutes atm...):
http://exscape.org/temp/libzfs_sendrecv.patch
Here's a oneliner to fetch, patch, compile and install:
cd /usr/src && fetch http://exscape.org/temp/libzfs_sendrecv.patch &&
patch -p0 < libzfs_sendrecv.patch && cd /usr/src/cddl/lib/libzfs &&
make && make install
No reboot required (nor do you need a reboot to revert, see the end of
the mail).
PLEASE NOTE that without this patch, you'll just get a segfault and
then an infinite loop of backups (since send/recv doesn't work in HEAD
since at least february, I'm guessing since ZFS v13). Too bad that the
patch is needed, since I suspect quite a few testers will bail out
there...
If not (great!), here's the script to wreck havoc:
http://exscape.org/temp/zfs_clone_panic.sh
After fetching the above, just run the script like "bash
zfs_clone_panic.sh crash" and you should have a panic in about 20
seconds. The other possiblee arguments are mostly there because I'm
too lazy to clean it up now that it "works".
Back to the panic:
The problem appears to be related to clones somehow - the first two
times I ran in to this panic (in real use) was when messing with clone/
promote... so that's what this script does.
Here's the backtrace it produces, and some info (show lockedvnods, if
that helps at all):
kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x288
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8036df39
stack pointer = 0x28:0xffffff803ea6d7e0
frame pointer = 0x28:0xffffff803ea6d840
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 16367 (zpool)
0xffffff000510fce8: tag zfs, type VDIR
usecount 1, writecount 0, refcount 1 mountedhere 0xffffff0002cd8bc0
flags ()
lock type zfs: EXCL by thread 0xffffff0024c5d000 (pid 16367)
0xffffff00050dc3b0: tag zfs, type VDIR
usecount 0, writecount 0, refcount 1 mountedhere 0
flags (VI_DOOMED)
VI_LOCKed lock type zfs: EXCL by thread 0xffffff0024c5d000 (pid
16367)
panic: from debugger
...
... boot, panic, trap etc ...
#10 0xffffffff805d36a7 in calltrap ()
at /usr/src/sys/amd64/amd64/exception.S:223
#11 0xffffffff8036df39 in _sx_xlock_hard (sx=0xffffff005d11a8e8,
tid=18446742974814867456, opts=Variable "opts" is not available.
) at /usr/src/sys/kern/kern_sx.c:575
#12 0xffffffff8036e8ae in _sx_xlock (sx=Variable "sx" is not available.
) at sx.h:155
#13 0xffffffff80b889e5 in zfs_freebsd_reclaim () from /boot/kernel/
zfs.ko
#14 0xffffffff8062447a in VOP_RECLAIM_APV (vop=0xffffff005d11a8d0,
a=0xffffff005d11a8e8) at vnode_if.c:1926
#15 0xffffffff803f3b0e in vgonel (vp=0xffffff00050dc3b0) at vnode_if.h:
830
#16 0xffffffff803f80c8 in vflush (mp=0xffffff0002cd8bc0, rootrefs=0,
flags=0,
td=0xffffff0024c5d000) at /usr/src/sys/kern/vfs_subr.c:2449
#17 0xffffffff80b833d8 in zfs_umount () from /boot/kernel/zfs.ko
#18 0xffffffff803f0d3a in dounmount (mp=0xffffff0002cd8bc0,
flags=47025088,
td=Variable "td" is not available.
) at /usr/src/sys/kern/vfs_mount.c:1289
#19 0xffffffff803f1568 in unmount (td=0xffffff0024c5d000,
uap=0xffffff803ea6dc00) at /usr/src/sys/kern/vfs_mount.c:1174
#20 0xffffffff805ed4cf in syscall (frame=0xffffff803ea6dc90)
at /usr/src/sys/amd64/amd64/trap.c:984
#21 0xffffffff805d3930 in Xfast_syscall () at /usr/src/sys/amd64/
amd64/exception.S:364
#22 0x000000080104e9ac in ?? ()
Previous frame inner to this frame (corrupt stack?)
Regards,
Thomas
PS.
A oneliner to get back to the non-patched state, if you're using SVN
(if not, sorry):
cd /usr/src && svn revert cddl/contrib/opensolaris/lib/libzfs/common/
libzfs_sendrecv.c && cd /usr/src/cddl/lib/libzfs && make && make install
DS.
Hope this helps track this down, as I spent quite a while on finding
the root cause (the clones, in one way or another), writing the
script, this mail, etc.
More information about the freebsd-fs
mailing list