ZFS panic under extreme circumstances (2/3 disks corrupted)

Mon May 25 16:12:59 UTC 2009

On May 25, 2009, at 05:39 PM, Freddie Cash wrote:

> On Mon, May 25, 2009 at 2:13 AM, Thomas Backman  
> <serenity at exscape.org> wrote:
>> On May 24, 2009, at 09:02 PM, Thomas Backman wrote:
>>
>>> So, I was playing around with RAID-Z and self-healing...
>>
>> Yet another follow-up to this.
>> It appears that all traces of errors vanish after a reboot. So, say  
>> you have
>> a dying disk; ZFS repairs the data for you, and you don't notice  
>> (unless you
>> check zpool status). Then you reboot, and there's NO (easy?) way  
>> that I can
>> tell to find out that something is wrong with your hardware!
>
> On our storage server that was initially configured using 1 large
> 24-drive raidz2 vdev (don't do that, by the way), we had 1 drive go
> south.  "zpool status" was full of errors.  And the error counts
> survived reboots.  Either that, or the drive was so bad that the error
> counts started increasing right away after a boot.  After a week of
> fighting with it to get the new drive to resilver and get added to the
> vdev, we nuked it and re-created it using 3 raidz2 vdevs each
> comprised of 8 drives.
>
> (Un)fortunately, that was the only failure we've had so far, so can't
> really confirm/deny the "error counts reset after reboot".

Was this on FreeBSD?

I have another unfortunate thing to note regarding this: after a  
reboot, it's even impossible to tell *which disk* has gone bad, even  
if the pool is "uncleared" but otherwise "healed". It simply says that  
a device has failed, with no clue as to which one, since they're all  
"ONLINE"!

Regards,
Thomas