Karl Denninger karl at denninger.net
Tue Apr 30 13:33:50 UTC 2019

On 4/30/2019 03:09, Michelle Sullivan wrote:
> Consider..
> If one triggers such a fault on a production server, how can one justify transferring from backup multiple terabytes (or even petabytes now) of data to repair an unmountable/faulted array.... because all backup solutions I know currently would take days if not weeks to restore the sort of store ZFS is touted with supporting.  

Had it happen on a production server a few years back with ZFS.  The
*hardware* went insane (disk adapter) and scribbled on *all* of the vdevs.

The machine crashed and would not come back up -- at all.  I insist on
(and had) emergency boot media physically in the box (a USB key) in any
production machine and it was quite-quickly obvious that all of the
vdevs were corrupted beyond repair.  There was no rational option other
than to restore.

It was definitely not a pleasant experience, but this is why when you
get into systems and data store sizes where it's a five-alarm pain in
the neck you must figure out some sort of strategy that covers you 99%
of the time without a large amount of downtime involved, and in the 1%
case accept said downtime.  In this particular circumstance the customer
didn't want to spend on a doubled-and-transaction-level protected
on-site (in the same DC) redundancy setup originally so restore, as
opposed to fail-over/promote and then restore and build a new
"redundant" box where the old "primary" resided was the most-viable
option.  Time to recover essential functions was ~8 hours (and over 24
hours for everything to be restored.)

Incidentally that's not the first time I've had a disk adapter failure
on a production machine in my career as a systems dude; it was, in fact,
the *third* such failure.  Then again I've been doing this stuff since
the 1980s and learned long ago that if it can break it eventually will,
and that Murphy is a real b******.

The answer to your question Michelle is that when restore times get into
"seriously disruptive" amounts of time (e.g. hours, days or worse
depending on the application involved and how critical it is) you spend
the time and money to have redundancy in multiple places and via paths
that do not destroy the redundant copies when things go wrong, and you
spend the engineering time to figure out what those potential faults are
and how to design such that a fault which can destroy the data set does
not propagate to the redundant copies before it is detected.

Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190430/a39b71e3/attachment.bin>

More information about the freebsd-stable mailing list