zraid2 loses a single disk and becomes difficult to recover
Alex Trull
alextzfs at googlemail.com
Sun Oct 11 16:27:22 UTC 2009
Well after trying alot of things (zpool import with or without cache file in
place, etc), it randomly managed to mount the pool up, atleast, with errors
- :
zfs list output:
cannot iterate filesystems: I/O error
NAME USED AVAIL REFER MOUNTPOINT
fatman 1.40T 1.70T 51.2K /fatman
fatman/backup 100G 99.5G 95.5G /fatman/backup
fatman/jail 422G 1.70T 60.5K /fatman/jail
fatman/jail/havnor 198G 51.7G 112G /fatman/jail/havnor
fatman/jail/mail 19.4G 30.6G 13.0G /fatman/jail/mail
fatman/jail/syndicate 16.6G 103G 10.5G /fatman/jail/syndicate
fatman/jail/thirdforces 159G 41.4G 78.1G /fatman/jail/thirdforces
fatman/jail/web 24.8G 25.2G 22.3G /fatman/jail/web
fatman/stash 913G 1.70T 913G /fatman/stash
(end of the dmesg)
JMR: vdev_uberblock_load_done ubbest ub_txg=46475461 ub_timestamp=1255231841
JMR: vdev_uberblock_load_done ub_txg=46481476 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46481476 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46475459 ub_timestamp=1255231780
JMR: vdev_uberblock_load_done ubbest ub_txg=46475458 ub_timestamp=1255231750
JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46481473 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46481472 ub_timestamp=1255234263
Solaris: WARNING: can't open objset for fatman/jail/margaret
Solaris: WARNING: can't open objset for fatman/jail/margaret
Solaris: WARNING: ZFS replay transaction error 86, dataset
fatman/jail/havnor, seq 0x25442, txtype 9
Solaris: WARNING: ZFS replay transaction error 86, dataset fatman/jail/mail,
seq 0x1e200, txtype 9
Solaris: WARNING: ZFS replay transaction error 86, dataset
fatman/jail/thirdforces, seq 0x732e3, txtype 9
[root at potjie /fatman/jail/mail]# zpool status -v
pool: fatman
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver in progress for 0h4m, 0.83% done, 8h21m to go
config:
NAME STATE READ WRITE CKSUM
fatman DEGRADED 0 0 34
raidz2 DEGRADED 0 0 384
replacing DEGRADED 0 0 0
da2/old REMOVED 0 24 0
da2 ONLINE 0 0 0 1.71G resilvered
ad4 ONLINE 0 0 0 21.3M resilvered
ad6 ONLINE 0 0 0 21.4M resilvered
ad20 ONLINE 0 0 0 21.3M resilvered
ad22 ONLINE 0 0 0 21.3M resilvered
ad17 ONLINE 0 0 0 21.3M resilvered
da3 ONLINE 0 0 0 21.3M resilvered
ad10 ONLINE 0 0 1 21.4M resilvered
ad16 ONLINE 0 0 0 21.2M resilvered
cache
ad18 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
fatman/jail/margaret:<0x0>
fatman/jail/syndicate:<0x0>
fatman/jail/mail:<0x0>
/fatman/jail/mail/tmp
fatman/jail/havnor:<0x0>
fatman/jail/thirdforces:<0x0>
fatman/backup:<0x0>
jail/margaret & backup isn't showing up in zfs list
jail/syndicate is showing up but isn't viewable
It seems the latest content on the better-looking zfs filesystems are quite
recent.
Any thoughts about what is going on ?
I have snapshots for africa on these zfs filesystems - any suggestions on
trying to get them back ?
--
Alex
2009/10/11 Alex Trull <alextzfs at googlemail.com>
> Hi All,
>
> My zraid2 has broken this morning on releng_7 zfs13.
>
> System failed this morning and came back without pool - having lost a disk.
>
> This is how I found the system:
>
> pool: fatman
> state: FAULTED
> status: One or more devices could not be used because the label is missing
> or invalid. There are insufficient replicas for the pool to continue
> functioning.
> action: Destroy and re-create the pool from a backup source.
> see: http://www.sun.com/msg/ZFS-8000-5E
> scrub: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> fatman FAULTED 0 0 1 corrupted data
> raidz2 DEGRADED 0 0 6
> da2 FAULTED 0 0 0 corrupted data
> ad4 ONLINE 0 0 0
> ad6 ONLINE 0 0 0
> ad20 ONLINE 0 0 0
> ad22 ONLINE 0 0 0
> ad17 ONLINE 0 0 0
> da2 ONLINE 0 0 0
> ad10 ONLINE 0 0 0
> ad16 ONLINE 0 0 0
>
> Initialy it complained that da3 had gone to da2 (da2 had failed and was no
> longer seen)
>
> I replaced the original da2 and bumped what was originaly da3 back up to
> da3 using the controllers ordering.
>
> [root at potjie /dev]# zpool status
> pool: fatman
> state: FAULTED
> status: One or more devices could not be used because the label is missing
> or invalid. There are insufficient replicas for the pool to continue
> functioning.
> action: Destroy and re-create the pool from a backup source.
> see: http://www.sun.com/msg/ZFS-8000-5E
> scrub: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> fatman FAULTED 0 0 1 corrupted data
> raidz2 ONLINE 0 0 6
> da2 UNAVAIL 0 0 0 corrupted data
> ad4 ONLINE 0 0 0
> ad6 ONLINE 0 0 0
> ad20 ONLINE 0 0 0
> ad22 ONLINE 0 0 0
> ad17 ONLINE 0 0 0
> da3 ONLINE 0 0 0
> ad10 ONLINE 0 0 0
> ad16 ONLINE 0 0 0
>
> Issue looks very similar to this (JMR's issue) :
> http://freebsd.monkey.org/freebsd-fs/200902/msg00017.html
>
> I've tried the methods there without much result.
>
> Using JMR's patches/debugs to see what is going on, this is what I got:
>
> JMR: vdev_uberblock_load_done ubbest ub_txg=46488653
> ub_timestamp=1255246834
> JMR: vdev_uberblock_load_done ub_txg=46475459 ub_timestamp=1255231780
> JMR: vdev_uberblock_load_done ubbest ub_txg=46488653
> ub_timestamp=1255246834
> JMR: vdev_uberblock_load_done ub_txg=46475458 ub_timestamp=1255231750
> JMR: vdev_uberblock_load_done ubbest ub_txg=46488653
> ub_timestamp=1255246834
> JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263
> JMR: vdev_uberblock_load_done ubbest ub_txg=46488653
> ub_timestamp=1255246834
> JMR: vdev_uberblock_load_done ub_txg=46481472 ub_timestamp=1255234263
> JMR: vdev_uberblock_load_done ubbest ub_txg=46488653
> ub_timestamp=1255246834
>
> But JMR's patch still doesn't let me import even with a decremented txg
>
> I then had a look around the drives using zdb and some dirty script:
>
> [root at potjie /dev]# ls /dev/ad* /dev/da2 /dev/da3 | awk '{print "echo
> "$1";zdb -l "$1" |grep txg"}' | sh
> /dev/ad10
> txg=46488654
> txg=46488654
> txg=46488654
> txg=46488654
> /dev/ad16
> txg=46408223 <- old TXGid ?
> txg=46408223
> txg=46408223
> txg=46408223
> /dev/ad17
> txg=46408223 <- old TXGid ?
> txg=46408223
> txg=46408223
> txg=46408223
> /dev/ad18 (ssd)
> /dev/ad19 (spare drive, removed from pool some time ago)
> txg=0
> create_txg=0
> txg=0
> create_txg=0
> txg=0
> create_txg=0
> txg=0
> create_txg=0
> /dev/ad20
> txg=46488654
> txg=46488654
> txg=46488654
> txg=46488654
> /dev/ad22
> txg=46488654
> txg=46488654
> txg=46488654
> txg=46488654
> /dev/ad4
> txg=46488654
> txg=46488654
> txg=46488654
> txg=46488654
> /dev/ad6
> txg=46488654
> txg=46488654
> txg=46488654
> txg=46488654
> /dev/da2 < new drive replaced broken da2
> /dev/da3
> txg=46488654
> txg=46488654
> txg=46488654
> txg=46488654
>
> I did not see any checksums or other issues on ad16 and ad17 previously,
> and I do check regularly.
>
> Any thoughts on what to try next ?
>
> Regards,
>
> Alex
>
>
More information about the freebsd-fs
mailing list