ZFS...

Tue Apr 30 09:51:00 UTC 2019

On Tue, Apr 30, 2019 at 5:08 PM Michelle Sullivan <michelle at sorbs.net>
wrote:

> but in my recent experience 2 issues colliding at the same time results in
> disaster
>

Do we know exactly what kind of corruption happen to your pool?  If you see
it twice in a row, it might suggest a software bug that should be
investigated.

Note that ZFS stores multiple copies of its essential metadata, and in my
experience with my old, consumer grade crappy hardware (non-ECC RAM, with
several faulty, single hard drive pool: bad enough to crash almost monthly
and damages my data from time to time), I've never seen a corruption this
bad and I was always able to recover the pool.  At previous employer, the
only case that we had the pool corrupted enough to the point that mount was
not allowed was because two host nodes happen to import the pool at the
same time, which is a situation that can be avoided with SCSI reservation;
their hardware was of much better quality, though.

Speaking for a tool like 'fsck': I think I'm mostly convinced that it's
*not* necessary, because at the point ZFS says the metadata is corrupted,
it means that these metadata was really corrupted beyond repair (all
replicas were corrupted; otherwise it would recover by finding out the
right block and rewrite the bad ones).

An interactive tool may be useful (e.g. "I saw data structure version 1, 2,
3 available, and all with bad checksum, choose which one you would want to
try"), but I think they wouldn't be very practical for use with large data
pools -- unlike traditional filesystems, ZFS uses copy-on-write and heavily
depends on the metadata to find where the data is, and a regular "scan" is
not really useful.

I'd agree that you need a full backup anyway, regardless what storage
system is used, though.