ZFS...

Wed May 8 01:01:24 UTC 2019

Karl Denninger wrote:
> On 5/7/2019 00:02, Michelle Sullivan wrote:
>> The problem I see with that statement is that the zfs dev mailing lists constantly and consistently following the line of, the data is always right there is no need for a “fsck” (which I actually get) but it’s used to shut down every thread... the irony is I’m now installing windows 7 and SP1 on a usb stick (well it’s actually installed, but sp1 isn’t finished yet) so I can install a zfs data recovery tool which reports to be able to “walk the data” to retrieve all the files...  the irony eh... install windows7 on a usb stick to recover a FreeBSD installed zfs filesystem...  will let you know if the tool works, but as it was recommended by a dev I’m hopeful... have another array (with zfs I might add) loaded and ready to go... if the data recovery is successful I’ll blow away the original machine and work out what OS and drive setup will be safe for the data in the future.  I might even put FreeBSD and zfs back on it, but if I do it won’t be in the current Zraid2 config.
> Meh.
>
> Hardware failure is, well, hardware failure.  Yes, power-related
> failures are hardware failures.
>
> Never mind the potential for /software /failures.  Bugs are, well,
> bugs.  And they're a real thing.  Never had the shortcomings of UFS bite
> you on an "unexpected" power loss?  Well, I have.  Is ZFS absolutely
> safe against any such event?  No, but it's safe*r*.

Yes and no ... I'll explain...

>
> I've yet to have ZFS lose an entire pool due to something bad happening,
> but the same basic risk (entire filesystem being gone)

Everytime I have seen this issue (and it's been more than once - though 
until now recoverable - even if extremely painful) - its always been 
during a resilver of a failed drive and something happening... panic, 
another drive failure, power etc.. any other time its rock solid... 
which is the yes and no... under normal circumstances zfs is very very 
good and seems as safe as or safer than UFS... but my experience is ZFS 
has one really bad flaw.. if there is a corruption in the metadata - 
even if the stored data is 100% correct - it will fault the pool and 
thats it it's gone barring some luck and painful recovery (backups 
aside) ... this other file systems also suffer but there are tools that 
*majority of the time* will get you out of the s**t with little pain.  
Barring this windows based tool I haven't been able to run yet, zfs 
appears to have nothing.

> has occurred more
> than once in my IT career with other filesystems -- including UFS, lowly
> MSDOS and NTFS, never mind their predecessors all the way back to floppy
> disks and the first 5Mb Winchesters.

Absolutely, been there done that.. and btrfs...*ouch* still as bad.. 
however with the only one btrfs install I had (I didn't knopw it was 
btrfs underneath, but netgear NAS...) I was still able to recover the 
data even though it had screwed the file system so bad I vowed never to 
consider or use it again on anything ever...

>
> I learned a long time ago that two is one and one is none when it comes
> to data, and WHEN two becomes one you SWEAT, because that second failure
> CAN happen at the worst possible time.

and does..

>
> As for RaidZ2 .vs. mirrored it's not as simple as you might think.
> Mirrored vdevs can only lose one member per mirror set, unless you use
> three-member mirrors.  That sounds insane but actually it isn't in
> certain circumstances, such as very-read-heavy and high-performance-read
> environments.

I know - this is why I don't use mirrored - because wear patterns will 
ensure both sides of the mirror are closely matched.

>
> The short answer is that a 2-way mirrored set is materially faster on
> reads but has no acceleration on writes, and can lose one member per
> mirror.  If the SECOND one fails before you can resilver, and that
> resilver takes quite a long while if the disks are large, you're dead.
> However, if you do six drives as a 2x3 way mirror (that is, 3 vdevs each
> of a 2-way mirror) you now have three parallel data paths going at once
> and potentially six for reads -- and performance is MUCH better.  A
> 3-way mirror can lose two members (and could be organized as 3x2) but
> obviously requires lots of drive slots, 3x as much *power* per gigabyte
> stored (and you pay for power twice; once to buy it and again to get the
> heat out of the room where the machine is.)

my problem (as always) is slots not so much the power.

>
> Raidz2 can also lose 2 drives without being dead.  However, it doesn't
> get any of the read performance improvement *and* takes a write
> performance penalty; Z2 has more write penalty than Z1 since it has to
> compute and write two parity entries instead of one, although in theory
> at least it can parallel those parity writes -- albeit at the cost of
> drive bandwidth congestion (e.g. interfering with other accesses to the
> same disk at the same time.)  In short RaidZx performs about as "well"
> as the *slowest* disk in the set.
Which is why I built mine with identical drives (though different 
production batches :) ) ... majority of the data in my storage array is 
write once (or twice) read many.

>    So why use it (particularly Z2) at
> all?  Because for "N" drives you get the protection of a 3-way mirror
> and *much* more storage.  A six-member RaidZ2 setup returns ~4Tb of
> usable space, where with a 2-way mirror it returns 3Tb and a 3-way
> mirror (which provides the same protection against drive failure as Z2)
> you have only *half* the storage.  IMHO ordinary Raidz isn't worth the
> trade-offs, but Z2 frequently is.
>
> In addition more spindles means more failures, all other things being
> equal, so if you need "X" TB of storage and organize it as 3-way mirrors
> you now have twice as many physical spindles which means on average
> you'll take twice as many faults.  If performance is more important then
> the choice is obvious.  If density is more important (that is, a lot or
> even most of the data is rarely accessed at all) then the choice is
> fairly simple too.  In many workloads you have some of both, and thus
> the correct choice is a hybrid arrangement; that's what I do here,
> because I have a lot of data that is rarely-to-never accessed and
> read-only but also have some data that is frequently accessed and
> frequently written.  One size does not fit all in such a workload.
This is where I came to 2 systems (with different data) .. one was for 
density, the other performance.  Storage vs working etc..

> MOST systems, by the way, have this sort of paradigm (a huge percentage
> of the data is rarely read and never written) but it doesn't become
> economic or sane to try to separate them until you get well into the
> terabytes of storage range and a half-dozen or so physical volumes.
> There's a  very clean argument that prior to that point but with greater
> than one drive mirrored is always the better choice.
>
> Note that if you have an *adapter* go insane (and as I've noted here
> I've had it happen TWICE in my IT career!) then *all* of the data on the
> disks served by that adapter is screwed.

100% with you - been there done that... and it doesn't matter what os or 
filesystem, hardware failure where silent data corruption happens 
because of an adapter will always take you out (and zfs will not save 
you in many cases of that either.)
>
> It doesn't make a bit of difference what filesystem you're using in that
> scenario and thus you had better have a backup scheme and make sure it
> works as well, never mind software bugs or administrator stupidity ("dd"
> as root to the wrong target, for example, will reliably screw you every
> single time!)
>
> For a single-disk machine ZFS is no *less* safe than UFS and provides a
> number of advantages, with arguably the most-important being easily-used
> snapshots.

Depends in normal operating I agree... but when it comes to all or 
nothing, that is a matter of perspective.  Personally I prefer to have 
in place recovery options and/or multiple *possible* recovery options 
rather than ... "destroy the pool and recreate it from scratch, hope you 
have backups"...

>    Not only does this simplify backups since coherency during
> the backup is never at issue and incremental backups become fast and
> easily-done in addition boot environments make roll-forward and even
> *roll-back* reasonable to implement for software updates -- a critical
> capability if you ever run an OS version update and something goes
> seriously wrong with it.  If you've never had that happen then consider
> yourself blessed;

I have been there (especially in the early days (pre 0.83 kernel) 
versions of Linux :) )

>   it's NOT fun to manage in a UFS environment and often
> winds up leading to a "restore from backup" scenario.  (To be fair it
> can be with ZFS too if you're foolish enough to upgrade the pool before
> being sure you're happy with the new OS rev.)
>
Actually I have a simple way with UFS (and ext2/3/4 etc) ... split the 
boot disk almost down the center.. create 3 partitions.. root, swap, 
altroot.  root and altroot are almost identical, one is always active, 
new OS goes on the other, switch to make the other active (primary) when 
you've tested... it's only gives one level of roll forward/roll back, 
but it works for me and has never failed (boot disk/OS wise) since I 
implemented it... but then I don't let anyone else in the company have 
root access so they cannot dd or "rm -r . /" or "rm -r .*" (both of 
which are the only way I have done that before - back in 1994 and never 
done it since - its something you learn or get out of IT :P .. and for 
those who didn't get the latter it should have been 'rm -r .??*' - and 
why are you on '-stable' ...? :P )

Regards,

-- 
Michelle Sullivan
http://www.mhix.org/