Cyclic permutations of "zpool replace" on raidz devices lead to corrupt data?

Nathaniel W Filardo nwf at cs.jhu.edu
Wed Mar 6 17:17:11 UTC 2013


Greetings freebsd-fs,

I had a zpool that looked like this:

        NAME             STATE     READ WRITE CKSUM
        tank0            ONLINE       0     0     0
          raidz1-0       ONLINE       0     0     0
            ada5         ONLINE       0     0     0
            ada0         ONLINE       0     0     0
            ada3         ONLINE       0     0     0
            ada1         ONLINE       0     0     0
        logs
          mirror-1       ONLINE       0     0     0
            ada2a        ONLINE       0     0     0
            ada4a        ONLINE       0     0     0
        cache
          ada2d          ONLINE       0     0     0
          ada4d          ONLINE       0     0     0

and, in a fit of OCD, I decided to attach a spare disk on ata6 and use it to
reorder the disks so that they were ada{0,1,3,5}.  I had thought this would
be painless, by running (and waiting for each resilver to complete)

zpool replace tank0 ada5 ada6
zpool replace tank0 ada1 ada5
zpool replace tank0 ada0 ada1
zpool replace tank0 ada6 ada0

Nothing funny, just a cyclic permutation.  I realize now that I should have
run a "zpool scrub" between each pass, but I didn't, so, oops.  (The last of
these commands has run to completion, but never removed the replacing-0 node
in the vdev tree; the pool is currently resilvering itself again after the
panic reported later in this mail.)  In any case, while I do not have exact
numbers to report, the following symptoms occurred during this chain of
events.

"zpool replace tank0 ada5 ada6" seemed to run without problem.
"zpool replace tank0 ada1 ada5" discovered 170-something checksum errors on ada6.
"zpool replace tank0 ada0 ada1" discovered 35-ish checksum errors on ada5.
"zpool replace tank0 ada6 ada0" discovered 9 checksum errors on ada1 and
  reported 8 checksum errors for the raidz1 vdev, including the corruption
  a file in my freebsd svn mirror.

I then removed the svn mirror, which seemed to go off without a hitch, and
started to rebuild it.  Much later, having decided to wait on rebuilding the
mirror, when shuffling files off of its host filesystem to another (from
tank0/mirrors/freebsd to tank0/mirrors/misc, in prepraration for deleting
the former, though this has not been done), I was met with

panic: trap: fast data access mmu miss (kernel)
cpuid = 0
KDB: stack backtrace:
panic() at panic+0x290
trap() at trap+0x554
-- fast data access mmu miss tar=0 %o7=0xc09b8df4 --
userland() at ddt_phys_decref
user trace: trap %o7=0xc09b8df4
pc 0xc0948e00, sp 0xf3a38b21
done
Uptime: 76d13h22m31s
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...

As a wild guess, this seems likely to be
http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html
in which a corrupt DDT yields a NULL pointer dereference when a DDT entry is
not found.

My suspicion (and it is just a guess at this point) is that somebody
somewhere in the stack is holding on to the "old" zpool configuration across
replace operations and issuing writes to the incorrect device(s).

A bit about the machine, in case it matters:
  It's a Sun V240 running 9-CURRENT (git rev id 1b82c3b) with 16GB of RAM.
  All the devices in this pool are connected by mvs0, a "Marvell 88SX6081
    SATA controller".
  There has been no prior indication of checksum errors on any of the
    devices, despite routine scrubbing every two weeks for as long as I can
    remember.
  The disks themselves are all WDC WD7500AADS-00L5B1; ada2 is an OCZ-VERTEX2
    and ada4 is an OCZ-SOLID3.
  At no point during this (including across the panic reboot) did the disks
    ever lose power.

A friend is helping me to test my hypothesis, but on Illumos (we do not have
easy access to another FBSD machine with sufficient spare disks).  We shall
report our findings.

Thoughts?
Thanks in advance.
--nwf;
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20130306/bce9b477/attachment.sig>


More information about the freebsd-fs mailing list