zraid2 loses a single disk and becomes difficult to recover
    Alex Trull 
    alextzfs at googlemail.com
       
    Sun Oct 11 15:02:14 UTC 2009
    
    
  
Hi All,
My zraid2 has broken this morning on releng_7 zfs13.
System failed this morning and came back without pool - having lost a disk.
This is how I found the system:
  pool: fatman
 state: FAULTED
status: One or more devices could not be used because the label is missing
    or invalid.  There are insufficient replicas for the pool to continue
    functioning.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-5E
 scrub: none requested
config:
    NAME        STATE     READ WRITE CKSUM
    fatman      FAULTED      0     0     1  corrupted data
      raidz2    DEGRADED     0     0     6
        da2     FAULTED      0     0     0  corrupted data
        ad4     ONLINE       0     0     0
        ad6     ONLINE       0     0     0
        ad20    ONLINE       0     0     0
        ad22    ONLINE       0     0     0
        ad17    ONLINE       0     0     0
        da2     ONLINE       0     0     0
        ad10    ONLINE       0     0     0
        ad16    ONLINE       0     0     0
Initialy it complained that da3 had gone to da2 (da2 had failed and was no
longer seen)
I replaced the original da2 and bumped what was originaly da3 back up to da3
using the controllers ordering.
[root at potjie /dev]# zpool status
  pool: fatman
 state: FAULTED
status: One or more devices could not be used because the label is missing
    or invalid.  There are insufficient replicas for the pool to continue
    functioning.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-5E
 scrub: none requested
config:
    NAME        STATE     READ WRITE CKSUM
    fatman      FAULTED      0     0     1  corrupted data
      raidz2    ONLINE       0     0     6
        da2     UNAVAIL      0     0     0  corrupted data
        ad4     ONLINE       0     0     0
        ad6     ONLINE       0     0     0
        ad20    ONLINE       0     0     0
        ad22    ONLINE       0     0     0
        ad17    ONLINE       0     0     0
        da3     ONLINE       0     0     0
        ad10    ONLINE       0     0     0
        ad16    ONLINE       0     0     0
Issue looks very similar to this (JMR's issue) :
http://freebsd.monkey.org/freebsd-fs/200902/msg00017.html
I've tried the methods there without much result.
Using JMR's patches/debugs to see what is going on, this is what I got:
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46475459 ub_timestamp=1255231780
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46475458 ub_timestamp=1255231750
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46481472 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
But JMR's patch still doesn't let me import even with a decremented txg
I then had a look around the drives using zdb and some dirty script:
[root at potjie /dev]# ls /dev/ad* /dev/da2 /dev/da3 | awk '{print "echo
"$1";zdb -l "$1" |grep txg"}' | sh
/dev/ad10
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad16
    txg=46408223 <- old TXGid ?
    txg=46408223
    txg=46408223
    txg=46408223
/dev/ad17
    txg=46408223 <- old TXGid ?
    txg=46408223
    txg=46408223
    txg=46408223
/dev/ad18 (ssd)
/dev/ad19 (spare drive, removed from pool some time ago)
    txg=0
    create_txg=0
    txg=0
    create_txg=0
    txg=0
    create_txg=0
    txg=0
    create_txg=0
/dev/ad20
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad22
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad4
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad6
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/da2 < new drive replaced broken da2
/dev/da3
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
I did not see any checksums or other issues on ad16 and ad17 previously, and
I do check regularly.
Any thoughts on what to try next ?
Regards,
Alex
    
    
More information about the freebsd-fs
mailing list