ZFS...

Karl Denninger karl at denninger.net
Tue May 7 13:47:05 UTC 2019


On 5/7/2019 00:02, Michelle Sullivan wrote:
> The problem I see with that statement is that the zfs dev mailing lists constantly and consistently following the line of, the data is always right there is no need for a “fsck” (which I actually get) but it’s used to shut down every thread... the irony is I’m now installing windows 7 and SP1 on a usb stick (well it’s actually installed, but sp1 isn’t finished yet) so I can install a zfs data recovery tool which reports to be able to “walk the data” to retrieve all the files...  the irony eh... install windows7 on a usb stick to recover a FreeBSD installed zfs filesystem...  will let you know if the tool works, but as it was recommended by a dev I’m hopeful... have another array (with zfs I might add) loaded and ready to go... if the data recovery is successful I’ll blow away the original machine and work out what OS and drive setup will be safe for the data in the future.  I might even put FreeBSD and zfs back on it, but if I do it won’t be in the current Zraid2 config.

Meh.

Hardware failure is, well, hardware failure.  Yes, power-related
failures are hardware failures.

Never mind the potential for /software /failures.  Bugs are, well,
bugs.  And they're a real thing.  Never had the shortcomings of UFS bite
you on an "unexpected" power loss?  Well, I have.  Is ZFS absolutely
safe against any such event?  No, but it's safe*r*.

I've yet to have ZFS lose an entire pool due to something bad happening,
but the same basic risk (entire filesystem being gone) has occurred more
than once in my IT career with other filesystems -- including UFS, lowly
MSDOS and NTFS, never mind their predecessors all the way back to floppy
disks and the first 5Mb Winchesters. 

I learned a long time ago that two is one and one is none when it comes
to data, and WHEN two becomes one you SWEAT, because that second failure
CAN happen at the worst possible time.

As for RaidZ2 .vs. mirrored it's not as simple as you might think. 
Mirrored vdevs can only lose one member per mirror set, unless you use
three-member mirrors.  That sounds insane but actually it isn't in
certain circumstances, such as very-read-heavy and high-performance-read
environments.

The short answer is that a 2-way mirrored set is materially faster on
reads but has no acceleration on writes, and can lose one member per
mirror.  If the SECOND one fails before you can resilver, and that
resilver takes quite a long while if the disks are large, you're dead. 
However, if you do six drives as a 2x3 way mirror (that is, 3 vdevs each
of a 2-way mirror) you now have three parallel data paths going at once
and potentially six for reads -- and performance is MUCH better.  A
3-way mirror can lose two members (and could be organized as 3x2) but
obviously requires lots of drive slots, 3x as much *power* per gigabyte
stored (and you pay for power twice; once to buy it and again to get the
heat out of the room where the machine is.)

Raidz2 can also lose 2 drives without being dead.  However, it doesn't
get any of the read performance improvement *and* takes a write
performance penalty; Z2 has more write penalty than Z1 since it has to
compute and write two parity entries instead of one, although in theory
at least it can parallel those parity writes -- albeit at the cost of
drive bandwidth congestion (e.g. interfering with other accesses to the
same disk at the same time.)  In short RaidZx performs about as "well"
as the *slowest* disk in the set.  So why use it (particularly Z2) at
all?  Because for "N" drives you get the protection of a 3-way mirror
and *much* more storage.  A six-member RaidZ2 setup returns ~4Tb of
usable space, where with a 2-way mirror it returns 3Tb and a 3-way
mirror (which provides the same protection against drive failure as Z2)
you have only *half* the storage.  IMHO ordinary Raidz isn't worth the
trade-offs, but Z2 frequently is.

In addition more spindles means more failures, all other things being
equal, so if you need "X" TB of storage and organize it as 3-way mirrors
you now have twice as many physical spindles which means on average
you'll take twice as many faults.  If performance is more important then
the choice is obvious.  If density is more important (that is, a lot or
even most of the data is rarely accessed at all) then the choice is
fairly simple too.  In many workloads you have some of both, and thus
the correct choice is a hybrid arrangement; that's what I do here,
because I have a lot of data that is rarely-to-never accessed and
read-only but also have some data that is frequently accessed and
frequently written.  One size does not fit all in such a workload.

MOST systems, by the way, have this sort of paradigm (a huge percentage
of the data is rarely read and never written) but it doesn't become
economic or sane to try to separate them until you get well into the
terabytes of storage range and a half-dozen or so physical volumes. 
There's a  very clean argument that prior to that point but with greater
than one drive mirrored is always the better choice.

Note that if you have an *adapter* go insane (and as I've noted here
I've had it happen TWICE in my IT career!) then *all* of the data on the
disks served by that adapter is screwed.

It doesn't make a bit of difference what filesystem you're using in that
scenario and thus you had better have a backup scheme and make sure it
works as well, never mind software bugs or administrator stupidity ("dd"
as root to the wrong target, for example, will reliably screw you every
single time!)

For a single-disk machine ZFS is no *less* safe than UFS and provides a
number of advantages, with arguably the most-important being easily-used
snapshots.  Not only does this simplify backups since coherency during
the backup is never at issue and incremental backups become fast and
easily-done in addition boot environments make roll-forward and even
*roll-back* reasonable to implement for software updates -- a critical
capability if you ever run an OS version update and something goes
seriously wrong with it.  If you've never had that happen then consider
yourself blessed; it's NOT fun to manage in a UFS environment and often
winds up leading to a "restore from backup" scenario.  (To be fair it
can be with ZFS too if you're foolish enough to upgrade the pool before
being sure you're happy with the new OS rev.)

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190507/3bc8a08d/attachment.bin>


More information about the freebsd-stable mailing list