raidz vdev marked faulted with only one faulted disk

Sun Jun 15 17:23:42 UTC 2008

i'm a bit lost here as to exactly what's gone wrong - it seems like it may
be a bug in zfs but also entirely likely i'm assuming something i shoudln't
or am just not using the zfs tools properly (i rather hope its the
latter...)

background:
i had a system running with 4 zpools. the two that are relevant to this
issue are the raidz volumes:
- 1x zpool (tank) consisting of a raidz vdev consisting of 7x250 GB slices
(each slice on a separate disk)
- 1x zpool (tank2) consisting of a raidz vdev consisting of 3x70 GB slices
(again, separate disks from each other, but these are slices on the same
disks as the other raidz vdev)
(this is a cheap home system built out of parts lying around, basically
intended to get a lot of storage space out of a bunch of disks with little
concern for performance, so no need to point out these problems)
the system was originally installed on a ufs partition then migrated onto a
raidz zpool (so i was using the kernel and /boot from the ufs drive still,
but the system root was on raidz), apart from well-known deadlocks and
panics here and there, it generally worked well enough (uptimes of a week or
so if i wasn't actively trying to trigger a deadlock/panic)

problem:
a couple of weeks ago, it completely stopped being able to mount root from
zfs, so i booted back into the old ufs partition (which still had whatever
world was originally installed on there from my 7.0-release amd64 CD but
with an up-to-date -stable kernel) and i discovered that one of my disks
(ad12) was now FAULTED. this is one of the disks that affects both raidz
vdevs mentioned above (i.e. it has a 250gb slice in tank and a 70gb slice in
tank2) so both raidz vdevs were effectively missing one disk device, but
both should be able to handle this type of failure... right?

i've not yet looked too far into the cause of the failure, though my guess
is it relates to the silicon image sil3114 controller that disk was attached
to (mainly due to the repuatation those controllers have) though for now i'm
trying to figure out the other major issue...

from 'zpool import' i can see that this disk (ad12) is marked "FAULTED
corrupted data" in the list of 7 drives in tank (i.e. ad12s1d), and the list
of 3 drives in tank2 (ad12s1e). in both zpools, the raidz vdev and the whole
zpool is then marked "FAULTED corrupted data", despite only one disk in the
raidz being FAULTED - my understanding is that it should be DEGRADED...
right?

example output showing tank2:
gutter# zpool import
  pool: tank2
    id: 8036862119610852708
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
    The pool may be active on on another system, but can be imported using
    the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

    tank2        FAULTED   corrupted data
      raidz1     FAULTED   corrupted data
        ad8s1e   ONLINE
        ad10s1e  ONLINE
        ad12s1e  FAULTED   corrupted data

tank2 is/was a zpool created as a single raidz vdev of 3 slices as shown
above, so it seems like the failure/loss of one disk shouldn't be causing it
to get marked FAULTED. tank is the same but with 7 drives (6 ONLINE, 1
FAULTED)... since the zpools were never exported (i'm unable to mount the
root file system on the zpool to export either of them...) they obviously
show up the errors above about being last accessed by another system, so
attempting to override that (zpool import -f tank or tank2) gives the
following messages on the console:

ZFS: vdev failure, zpool=tank2 type=vdev.no_replicas
ZFS: failed to load zpool tank2
cannot import 'tank2': permission denied

when i first booted into the old ufs drive that the zpools were created
from, they showed up on that system in zpool list (since they'd not been
exported after i set it up to use one as the root fs), and zpool status told
me to see: http://www.sun.com/msg/ZFS-8000-5E, which is about zpools that
have a faulted device and no redundancy - *very* odd to see on a raidz vdev

i've also tried completely removing the faulted disk with no better result,
and removing two drives causes it to show up as UNAVAIL (as expected) or a
"panic: dangling dbufs" when i try to 'zpool import', though i suspect this
might be memory related (i've also been trying all of this on a second
motherboard, which i can only supply with 512 MB RAM)

i've tried various different combinations
- hardware - two different motherboards (with different cpu and ram, the
only thing common to all systems is a new sata controller - promise chip
PDC20376 - to replace the silicon image sata controller so that i can put
all 7 drives into the system)
- software - a fresh FreeBSD install on a new hard drive (from a 7.0 i386 CD
i downloaded about 3 months ago, and then again after updating to the latest
-stable source), as well as the system i mentioned earlier on my boot/kernel
drive, which had the latest amd64 kernel built on my zfs system but hadn't
had the userland updated since 7.0 amd64 install)

when i first set up the system, i tested out the behaviour of removing a
drive from a raidz vdev, and definitely saw it enter the DEGRADED state,
though i did not try exporting and re-importing in this state, but according
to the sun zfs documentation this should be possible (i realise this doesn't
mean its in the bsd port, but i've not found anything to confirm this
specifically is/isn't possible)

so my question is - is this a bug in zfs that is causing the raidz to be
faulted when one device is faulted/corrupted (would have to be under
specific conditions, since raidz vdevs can definitely go between DEGRADED
and ONLINE states just fine in general), or am i misusing the zfs utilities
or making invalid assumptions, e.g. is there some other method of importing
or perhaps scrubbing/resilvering prior to importing that i'm missing?

Andrew