Raidz2 pool with single disk failure is faulted

Tue Feb 3 05:41:28 PST 2009

Wesley Morgan escribió:
> On Tue, 3 Feb 2009, Javier Martín Rueda wrote:
>
>> I solved the problem. This is how I did it, in case one day it 
>> prevents somebody from jumping in front of a train :-)
>>
>> First of all, I got some insight from various sites, mailing list 
>> archives, documents, etc. Among them, maybe these two were more helpful:
>>
>> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html 
>>
>> http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf
>>
>> I suspected that maybe my uberblock was somehow corrupted, and 
>> thought it would be worthwhile to rollback to an earlier uberblock. 
>> However, my pool was raidz2 and the examples I had seen about how to 
>> do this were with simple pools, so I tried a different approach, 
>> which in the end proved very successful:
>>
>> First, I added a couple of printf to vdev_uberblock_load_done(), 
>> which is in 
>> /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c:
>>
>> --- vdev_label.c.orig   2009-02-03 13:14:35.000000000 +0100
>> +++ vdev_label.c        2009-02-03 13:14:52.000000000 +0100
>> @@ -659,10 +659,12 @@
>>
>>       if (zio->io_error == 0 && uberblock_verify(ub) == 0) {
>>               mutex_enter(&spa->spa_uberblock_lock);
>> +               printf("JMR: vdev_uberblock_load_done ub_txg=%qd 
>> ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp);
>>               if (vdev_uberblock_compare(ub, ubbest) > 0)
>>                       *ubbest = *ub;
>>               mutex_exit(&spa->spa_uberblock_lock);
>>       }
>> +       printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd 
>> ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp);
>>
>>       zio_buf_free(zio->io_data, zio->io_size);
>> }
>>
>> After compiling and loading the zfs.ko module, I executed "zpool 
>> import" and these messages came up:
>>
>> ...
>> JMR: vdev_uberblock_load_done ub_txg=4254783 ub_timestamp=1233545538
>> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 
>> ub_timestamp=1233545538
>> JMR: vdev_uberblock_load_done ub_txg=4254782 ub_timestamp=1233545533
>> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 
>> ub_timestamp=1233545538
>> JMR: vdev_uberblock_load_done ub_txg=4254781 ub_timestamp=1233545528
>> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 
>> ub_timestamp=1233545538
>> JMR: vdev_uberblock_load_done ub_txg=4254780 ub_timestamp=1233545523
>> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 
>> ub_timestamp=1233545538
>> JMR: vdev_uberblock_load_done ub_txg=4254779 ub_timestamp=1233545518
>> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 
>> ub_timestamp=1233545538
>> JMR: vdev_uberblock_load_done ub_txg=4254778 ub_timestamp=1233545513
>> ...
>> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 
>> ub_timestamp=1233545538
>>
>> So, the uberblock with transaction group 4254783 was the most recent. 
>> I convinced ZFS to use an earlier one with this patch (note the 
>> second expression I added to the if statement):
>>
>> --- vdev_label.c.orig   2009-02-03 13:14:35.000000000 +0100
>> +++ vdev_label.c        2009-02-03 13:25:43.000000000 +0100
>> @@ -659,10 +659,12 @@
>>
>>       if (zio->io_error == 0 && uberblock_verify(ub) == 0) {
>>               mutex_enter(&spa->spa_uberblock_lock);
>> -               if (vdev_uberblock_compare(ub, ubbest) > 0)
>> +               printf("JMR: vdev_uberblock_load_done ub_txg=%qd 
>> ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp);
>> +               if (vdev_uberblock_compare(ub, ubbest) > 0 && 
>> ub->ub_txg < 4254783)
>>                       *ubbest = *ub;
>>               mutex_exit(&spa->spa_uberblock_lock);
>>       }
>> +       printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd 
>> ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp);
>>
>>       zio_buf_free(zio->io_data, zio->io_size);
>> }
>>
>> After compiling and loading the zfs.ko module, I executed "zpool 
>> import" and the pool was still faulted. So, I decremented the limit 
>> txg to "< 4254782" and this time the zpool came up as ONLINE. After 
>> crossing my fingers I executed "zpool import z1", and it worked ok. 
>> No data loss, everything back to normal. The only curious thing I've 
>> noticed is this:
>>
>> # zpool status
>> pool: z1
>> state: ONLINE
>> status: One or more devices could not be used because the label is 
>> missing or
>>       invalid.  Sufficient replicas exist for the pool to continue
>>       functioning in a degraded state.
>> action: Replace the device using 'zpool replace'.
>>  see: http://www.sun.com/msg/ZFS-8000-4J
>> scrub: resilver completed with 0 errors on Tue Feb  3 09:26:40 2009
>> config:
>>
>>       NAME                     STATE     READ WRITE CKSUM
>>       z1                       ONLINE       0     0     0
>>         raidz2                 ONLINE       0     0     0
>>           mirror/gm0           ONLINE       0     0     0
>>           mirror/gm1           ONLINE       0     0     0
>>           da2                  ONLINE       0     0     0
>>           da3                  ONLINE       0     0     0
>>           8076139616933977534  UNAVAIL      0     0     0  was /dev/da4
>>           da5                  ONLINE       0     0     0
>>           da6                  ONLINE       0     0     0
>>           da7                  ONLINE       0     0     0
>>
>> errors: No known data errors
>>
>> As you can see, the raidz2 vdev is marked as ONLINE, when I think it 
>> should be DEGRADED. Nevertheless, the pool is readable and writeable, 
>> and so far I haven't detected any problem. To be safe, I am 
>> extracting all the data and I will recreate the pool again from 
>> scratch, just in case.
>>
>>
>> Pending questions:
>>
>> 1) Why did the "supposed corruption" happened in the first place? I 
>> advise people not to mix disks from different zpools with the same 
>> name in the same computer. That's what I did, and maybe it's what 
>> caused my problems.
>>
>> 2) Rolling back to an earlier uberblock seems to solve some faulted 
>> zpool problems. I think it would be interesting to have a program 
>> that let you do it in a user-friendly way (after warning you about 
>> the dangers, etc.).
>>
>
>
> It would be interesting to see if the txid from all of your labels was 
> the same. I would highly advise scrubbing your array.
I did a zdb -l in all the healthy disks, and all the labels (4 copies x 
7 devices) were identical, except for the "guid" field at the beginning. 
That's the vdev's guid, so I think it's normal it's different for each 
disk. The txg field was identical in all of them.
>
> I believe the reason that your "da4" is showing up with only a uuid is 
> because zfs is now recognizing that the da4 it sees is not the correct 
> one. Still very curious how you ended up in that situation. I wonder 
> if you had corruption that was unknown before you removed da4.
Definitely the current da4 has nothing to do with the zpool. First it 
belonged to a different zpool and later I zeroed the beginning and end. 
The GUID that is listed in "zpool status" is the same one that appears 
in the zpool labels for the old da4.

I don't recall seeing any corruption before, and I scrubbed the pool 
from time to time. By the way, thinking again about this, the timestamp 
on the most recent uberblock was 6:32 CET, which also coincides with the 
time that the server froze, while the change of disks took place about 
2-3 hours later. So, maybe the change of disks had nothing to do with 
all this after all. The disks are connected to a RAID controller, 
although they are exported in pass-through mode.