ZFS...

Wed May 8 03:10:06 UTC 2019

>
>
> Everytime I have seen this issue (and it's been more than once - though
> until now recoverable - even if extremely painful) - its always been
> during a resilver of a failed drive and something happening... panic,
> another drive failure, power etc.. any other time its rock solid...
> which is the yes and no... under normal circumstances zfs is very very
> good and seems as safe as or safer than UFS... but my experience is ZFS
> has one really bad flaw.. if there is a corruption in the metadata -
> even if the stored data is 100% correct - it will fault the pool and
> thats it it's gone barring some luck and painful recovery (backups
> aside) ... this other file systems also suffer but there are tools that
> *majority of the time* will get you out of the s**t with little pain.
> Barring this windows based tool I haven't been able to run yet, zfs
> appears to have nothing.
>
>
> This is the difference I see here. You keep says that all of the data
drive is 100% correct, that is only the meta data on the drive that is
incorrect/corrupted. How do you know this? Especially, how to you know
before you recovered the data from the drive. As ZFS meta data is stored
redundantly on the drive and never in an inconsistent form (that is what
fsck does, it fixes the inconsistent data that most other filesystems store
when they crash/have disk issues). If the meta data is corrupted, how
would  ZFS know what other correct (computers don't understand things, they
just follow the numbers)? If the redundant copies of the meta data are
corrupt, what are the odds that the file data is corrupt? In my experience,
getting the meta data trashed and none of the file data trashed is a rare
event on a system with multi-drive redundancy.

I have a friend/business partner that doesn't want to move to ZFS because
his recovery method is wait for a single drive (no-redundancy, sometimes no
backup) to fail and then use ddrescue to image the broken drive to a new
drive (ignoring any file corruption because you can't really tell without
ZFS). He's been using disk rescue programs for so long that he will not
move to ZFS, because it doesn't have a disk rescue program. He has systems
on Linux with ext3 and no mirroring or backups. I've asked about moving
them to a mirrored ZFS system and he has told me that the customer doesn't
want to pay for a second drive (but will pay for hours of his time to fix
the problem when it happens). You kind of sound like him. ZFS is risky
because there isn't a good drive rescue program. Sun's design was that the
system should be redundant by default and checksum everything. If the
drives fail, replace them. If they fail too much or too fast, restore from
backup. Once the system had too much corruption, you can't recover/check
for all the damage without a second off disk copy. If you have that off
disk, then you have backup. They didn't build for the standard use case as
found in PCs because the disk recover programs rarely get everything back,
therefore they can't be relied on to get you data back when your data is
important. Many PC owners have brought PC mindset ideas to the "UNIX"
world. Sun's history predates Windows and Mac and comes from a
Mini/Mainframe mindset (were people tried not to guess about data
integrity).

Would a disk rescue program for ZFS be a good idea? Sure. Should the lack
of a disk recovery program stop you from using ZFS? No. If you think so, I
suggest that you have your data integrity priorities in the wrong order
(focusing on small, rare events rather than the common base case).

Walter

-- 
The greatest dangers to liberty lurk in insidious encroachment by men of
zeal, well-meaning but without understanding.   -- Justice Louis D. Brandeis