ZFS pool faulted (corrupt metadata) but the disk data appears ok...

Fri Feb 2 16:12:06 UTC 2018

On 02 Feb 2018 11:51, Michelle Sullivan wrote:

> Ben RUBSON wrote:
>> On 02 Feb 2018 11:26, Michelle Sullivan wrote:
>>
>> Hi Michelle,
>>
>>> Michelle Sullivan wrote:
>>>> Michelle Sullivan wrote:
>>>>> So far (few hours in) zfs import -fFX has not faulted with this  
>>>>> image...
>>>>> it's running out of memory currently about 16G of 32G- however 9.2-P15
>>>>> kernel died within minutes... out of memory (all 32G and swap) so am
>>>>> more optimistic at the moment...  Fingers Crossed.
>>>> And the answer:
>>>>
>>>> 11-STABLE on a USB stick.
>>>>
>>>> Remove the drive that was replacing the hotspare (ie the replacement
>>>> drive for the one that initially died)
>>>> zpool import -fFX storage
>>>> zpool export storage
>>>>
>>>> reboot back to 9.x
>>>> zpool import storage
>>>> re-insert drive replacement drive.
>>>> reboot
>>> Gotta thank people for this again, saved me again this time on a  
>>> non-FreeBSD system this time (with a lot of using a modified  
>>> recoverdisk for OSX - thanks PSK@)... Lost 3 disks out of a raidz2 and  
>>> 2 more had read errors on some sectors.. don't know how much (if any)  
>>> data I've lost but at least it's not a rebuild from back up of all  
>>> 48TB..
>>
>> What about the root-cause ?
>
> 3 disks died whilst the server was in transit from Malta to Australia  
> (and I'm surprised that was all considering the state of some of the  
> stuff that came out of the container - have a 3kva UPS that is completely  
> destroyed despite good packing.)
>> Sounds like you had 5 disks dying at the same time ?
>
> Turns out that one of the 3 that had 'red lights on' had bad sectors, the  
> other 2 were just excluded by the BIOS...  I did a byte copy onto new  
> drives found no read errors so put them back in and forced them online.   
> The other 1 had 78k of bytes unreadable so new disk went in and an  
> convinced the controller that it was the same disk as the one it  
> replaced, the export/import produced 2 more disks unrecoverable read  
> errors that nothing had flagged previously, so byte copied them onto new  
> drives and the import -fFX is currently working (5 hours so far)...
>
>> Do you periodically run long smart tests ?
>
> Yup (fully automated.)
>
>> Zpool scrubs ?
>
> Both servers took a zpool scrub before they were packed into the  
> containers... the second one came out unscathed... but then most stuff in  
> the second container came out unscathed unlike the first....

What a story ! Thanks for the details.

So disks died because of the carrier, as I assume the second unscathed  
server was OK...
Heads must have scratched the platters, but they should have been parked,  
so... Really strange.

Hope you'll recover your whole pool.

Ben