kern/125644: zfs unfixable fs errors caused panic when trying to
destroy filesystem
Mike Andrews
mandrews at fark.com
Tue Jul 15 17:00:11 UTC 2008
>Number: 125644
>Category: kern
>Synopsis: zfs unfixable fs errors caused panic when trying to destroy filesystem
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Tue Jul 15 17:00:10 UTC 2008
>Closed-Date:
>Last-Modified:
>Originator: Mike Andrews
>Release: FreeBSD 7.0-STABLE amd64
>Organization:
Fark, Inc
>Environment:
System: FreeBSD whiskey.fark.com 7.0-STABLE FreeBSD 7.0-STABLE #21: Thu Jul 3 16:13:09 EDT 2008 mandrews at vodka.int.fark.com:/usr/obj/usr/src/sys/FARK64 amd64
Supermicro PDSMi+, Core 2 Quad Q6600, 6 GB memory
Two ST3250820AS/3.AAE connected to onboard ICH7 in AHCI mode
>Description:
The root filesystem is a 4 GB gmirror of ad4s1a+ad6s1a, both drives then have
4 GB swap partitions, and the remaining ad4s1d+ad6s1d is a mirrored zpool.
Last Friday, these messages appeared:
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad6s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=85723949056 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: checksum mismatch, zpool=whiskey path=/dev/ad4s1d offset=4194304 size=50688
Jul 11 10:49:15 <user.warn> whiskey root: ZFS: zpool I/O failure, zpool=whiskey error=86
I did a scrub on the zpool to see if ZFS could correct the errors, and it
said it could not. However, only one file was damaged, and it was in an
old snapshot I didn't care about:
whiskey# zpool status -v
pool: whiskey
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed with 1 errors on Fri Jul 11 10:56:41 2008
config:
NAME STATE READ WRITE CKSUM
whiskey ONLINE 0 0 4
mirror ONLINE 0 0 4
ad4s1d ONLINE 0 0 8
ad6s1d ONLINE 0 0 8
errors: Permanent errors have been detected in the following files:
whiskey/home at monthly.1:<filename snipped; it was a small unneeded jpeg>
Here's the issue: attempting to destroy that snapshot resulted in a panic:
whiskey# zfs destroy whiskey/home at monthly.1
panic: solaris assert: end <= sm->sm_start + sm->sm_size (0x14454c7000 <= 0x1400000000), file: /usr/src/sys/modules/zfs/../../kernel trap 12 with interrupts disabled
cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line: 93
cpuid = 1
KDB: enter: panic
[thread pid 208 tid 100152 ]
Stopped at kdb_enter_why+0x3d: movq $0,0x40ba01(%rip)
db> bt
Tracing pid 208 tid 100152 td 0xffffff00025ae9f0
kdb_enter_why() at kdb_enter_why+0x3d
panic() at panic+0x16c
space_map_add() at space_map_add+0x227
metaslab_free_dva() at metaslab_free_dva+0xfe
metaslab_free() at metaslab_free+0x6e
zio_dva_free() at zio_dva_free+0x20
arc_free() at arc_free+0x10a
dsl_dataset_destroy_sync() at dsl_dataset_destroy_sync+0x2df
dsl_sync_task_group_sync() at dsl_sync_task_group_sync+0x13e
dsl_pool_sync() at dsl_pool_sync+0xc3
spa_sync() at spa_sync+0x38a
txg_sync_thread() at txg_sync_thread+0x129
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffc77c9d30, rbp = 0 ---
db>
Unfortunately, I had to get the machine back up and running quickly, so I
did not dd the corrupted disk to an image file for further analysis... I
just backed up all the files, wiped the zpool and recreated it, and restored,
and have been running fine since. So I'm not able to do any further
troubleshooting than this. I'm mostly filing this as an FYI/heads-up to
what may (not?) have have been a one-off quirk.
I guess for me the curiosities are, how did the corruption happen (errno.h
says error 86 is "illegal byte sequence"...) in a way that affected both disks
and why did zfs panic over it instead of allowing the bad data to be deleted.
This system has hw.ata.wc = 1 which is known dangerous in a UFS2 situation
(this is safe for ZFS, though, right? Uh... right?) :) However I'm pretty
sure the machine has not lost power abruptly in a very long time so I don't
think that was an issue.
>How-To-Repeat:
No idea, unfortunately, unless scribbling a small chunk of /dev/random onto
the middle of a zpool would do it :)
>Fix:
Backup, destroy, recreate, restore the zpool.
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list