Raidz2 pool with single disk failure is faulted

Mon Feb 2 17:33:11 PST 2009

On a FreeBSD 7.1-PRERELEASE amd64 system I had a raidz2 pool made up of 
8 disks. Due to some things I tried in the past, the pool was currently 
like this:

        z1              ONLINE
          raidz2        ONLINE
            mirror/gm0  ONLINE
            mirror/gm1  ONLINE
            da2         ONLINE
            da3         ONLINE
            da4         ONLINE
            da5         ONLINE
            da6         ONLINE
            da7         ONLINE

da2 to da7 where originally mirror/gm2 to mirror/gm7, but I replaced 
them little by little, eliminating the corresponding gmirrors at the 
same time. I don't think this is relevant for what I'm goint to explain, 
but I mention it just in case...

One day, after a system reboot, one of the disks (da4) was dead and 
FreeBSD renamed all of the other disks that used to be after it (da5 
became da4, da6 became da5, and da7 became da6). The pool was 
unavailable (da4 to da6 marked as corrupt and da7 as unavailable) 
because I suppose ZFS couldn't match the contents in the last 3 disks to 
their new names. I was able to fix this by inserting a blank new disk, 
rebooting, now the disk names were correct again, and the pool showed up 
as degraded because da4 was unavailable, but usable. I resilvered the 
pool and everything was back to normal.

Yesterday, another disk died after a system reboot and the pool was 
unavailable again because of the automatic renaming of the SCSI disks. 
However, this time I didn't substitute it by a blank disk, but for 
another identical disk which I had been using in the past in a different 
ZFS pool on a different computer, but with the same name (z1) and same 
characteristics (raidz2, 8 disks). The disk hadn't been erased and its 
pool hadn't been destroyed, so it still had whatever ZFS stored in it.

After rebooting, it seems ZFS got confused or something when it found 
out about two different active pools with the same name, etc. and it 
faulted the pool. I stopped ZFS, wiped the beginning and end of the disk 
with zeroes, but the problem persisted. Finally, I tried to export and 
import the pool, as I read somewhere that may help, but zpool import 
complains about an I/O error (which I imagine is ficticious, because all 
of the disks are find, I can read from them with dd no problem).

The current situation is this:

# zpool import
  pool: z1
    id: 8828203687312199578
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on on another system, but can be imported 
using
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

        z1              FAULTED   corrupted data
          raidz2        ONLINE
            mirror/gm0  ONLINE
            mirror/gm1  ONLINE
            da2         ONLINE
            da3         ONLINE
            da4         UNAVAIL   corrupted data
            da5         ONLINE
            da6         ONLINE
            da7         ONLINE
# zpool import -f z1
cannot import 'z1': I/O error

By the way, before exporting the pool, the CKSUM column in "zpool 
status" showed 6 errors. However, zpool status -v didn't give any 
additional information.

How come the pool is faulted if it is raidz2 and 7 out of 8 disks are 
reported as fine? Any idea how to recover the pool? The data has to be 
in there, as I haven't done any other destructive operation, as far as I 
can think of, and I imagine it should be some stupid little detail.

I have dumped all of the labels in the 8 disks with zdb -l, and I don't 
see anything peculiar. They are fine in the 7 online disks, and it 
doesn't exist in the da4 disk.

Is there some kind of diagnostic tools similar to dumpfs, but for zfs?

I can provide additional information if needed.