ZFS pool faulted (corrupt metadata) but the disk data appears ok...
Michelle Sullivan
michelle at sorbs.net
Fri Feb 2 20:49:02 UTC 2018
Ben RUBSON wrote:
> On 02 Feb 2018 11:51, Michelle Sullivan wrote:
>
>> Ben RUBSON wrote:
>>> On 02 Feb 2018 11:26, Michelle Sullivan wrote:
>>>
>>> Hi Michelle,
>>>
>>>> Michelle Sullivan wrote:
>>>>> Michelle Sullivan wrote:
>>>>>> So far (few hours in) zfs import -fFX has not faulted with this
>>>>>> image...
>>>>>> it's running out of memory currently about 16G of 32G- however
>>>>>> 9.2-P15
>>>>>> kernel died within minutes... out of memory (all 32G and swap) so am
>>>>>> more optimistic at the moment... Fingers Crossed.
>>>>> And the answer:
>>>>>
>>>>> 11-STABLE on a USB stick.
>>>>>
>>>>> Remove the drive that was replacing the hotspare (ie the replacement
>>>>> drive for the one that initially died)
>>>>> zpool import -fFX storage
>>>>> zpool export storage
>>>>>
>>>>> reboot back to 9.x
>>>>> zpool import storage
>>>>> re-insert drive replacement drive.
>>>>> reboot
>>>> Gotta thank people for this again, saved me again this time on a
>>>> non-FreeBSD system this time (with a lot of using a modified
>>>> recoverdisk for OSX - thanks PSK@)... Lost 3 disks out of a raidz2
>>>> and 2 more had read errors on some sectors.. don't know how much
>>>> (if any) data I've lost but at least it's not a rebuild from back
>>>> up of all 48TB..
>>>
>>> What about the root-cause ?
>>
>> 3 disks died whilst the server was in transit from Malta to Australia
>> (and I'm surprised that was all considering the state of some of the
>> stuff that came out of the container - have a 3kva UPS that is
>> completely destroyed despite good packing.)
>>> Sounds like you had 5 disks dying at the same time ?
>>
>> Turns out that one of the 3 that had 'red lights on' had bad sectors,
>> the other 2 were just excluded by the BIOS... I did a byte copy onto
>> new drives found no read errors so put them back in and forced them
>> online. The other 1 had 78k of bytes unreadable so new disk went in
>> and an convinced the controller that it was the same disk as the one
>> it replaced, the export/import produced 2 more disks unrecoverable
>> read errors that nothing had flagged previously, so byte copied them
>> onto new drives and the import -fFX is currently working (5 hours so
>> far)...
>>
>>> Do you periodically run long smart tests ?
>>
>> Yup (fully automated.)
>>
>>> Zpool scrubs ?
>>
>> Both servers took a zpool scrub before they were packed into the
>> containers... the second one came out unscathed... but then most
>> stuff in the second container came out unscathed unlike the first....
>
> What a story ! Thanks for the details.
>
> So disks died because of the carrier, as I assume the second unscathed
> server was OK...
Pretty much.
> Heads must have scratched the platters, but they should have been
> parked, so... Really strange.
>
You'd have thought... though 2 of the drives look like it was wear and
wear issues (the 2 not showing red lights) just not picked up on the
periodic scrub.... Could be that the recovery showed that one up... you
know - how you can have an array working fine, but one disk dies then
others fail during the rebuild because of the extra workload.
> Hope you'll recover your whole pool.
So do I, it was my build server, everything important backed up with
multiple redundancies except for the build VMs.. it'd take me about 4
weeks to rebuild it if I have to put it all back from backups and
rebuild the build VMs.. but hey at least I can rebuild it unlike many
with big servers. :P
That said, the import -fFX is still running (and it is actually running)
so it's still scanning/rebuilding the meta data.
Michelle
--
Michelle Sullivan
http://www.mhix.org/
More information about the freebsd-fs
mailing list