Musings on ZFS Backup strategies

Sat Mar 2 22:57:14 UTC 2013

On 3/2/2013 4:14 PM, Peter Jeremy wrote:
> On 2013-Mar-01 08:24:53 -0600, Karl Denninger <karl at denninger.net> wrote:
>> If I then restore the base and snapshot, I get back to where I was when
>> the latest snapshot was taken.  I don't need to keep the incremental
>> snapshot for longer than it takes to zfs send it, so I can do:
>>
>> zfs snapshot pool/some-filesystem at unique-label
>> zfs send -i pool/some-filesystem at base pool/some-filesystem at unique-label
>> zfs destroy pool/some-filesystem at unique-label
>>
>> and that seems to work (and restore) just fine.
> This gives you an incremental since the base snapshot - which will
> probably grow in size over time.  If you are storing the ZFS send
> streams on (eg) tape, rather than receiving them, you probably still
> want the "Towers of Hanoi" style backup hierarchy to control your
> backup volume.  It's also worth noting that whilst the stream will
> contain the compression attributes of the filesystem(s) in it, the
> actual data is the stream in uncompressed
I noted that.  The script I wrote to do this looks at the compression
status in the filesystem and, if enabled, pipes the data stream through
pbzip2 on the way to storage.  The only problem with this presumption is
that for database "data" filesystems the "best practices" say that you
should set the recordsize to that of the underlying page size of the
dbms (e.g. 8k for Postgresql) for best performance and NOT enable
compression.

Reality however is that the on-disk format of most database files is
EXTREMELY compressible (often WELL better than 2:1), so I sacrifice
there.  I think the better option is to stuff a user parameter into the
filesystem attribute table (which apparently I can do without boundary)
telling the script whether or not to compress on output so it's not tied
to the filesystem's compression setting.

I'm quite-curious, in fact, as to whether the "best practices" really
are in today's world.  Specifically, for a CPU-laden machine with lots
of compute power I wonder if enabling compression on the database
filesystems and leaving the recordsize alone would be a net performance
win due to the reduction in actual I/O volume.  This assumes you have
the CPU available, of course, but that has gotten cheaper much faster
than I/O bandwidth has.

>> This in turn means that keeping more than two incremental dumps offline
>> has little or no value; the second merely being taken to insure that
>> there is always at least one that has been written to completion without
>> error to apply on top of the base.
> This is quite a critical point with this style of backup: The ZFS send
> stream is not intended as an archive format.  It includes error
> detection but no error correction and any error in a stream renders
> the whole stream unusable (you can't retrieve only part of a stream).
> If you go this way, you probably want to wrap the stream in a FEC
> container (eg based on ports/comms/libfec) and/or keep multiple copies.
That's no more of a problem than it is for a dump file saved on a disk
though, is it?  While restore can (putatively) read past errors on a
tape, in reality if the storage is a disk and part of the file is
unreadable the REST of that particular archive is unreadable.  Skipping
unreadable records does "sorta work" for tapes, but it rarely if ever
does for storage onto a spinning device within the boundary of the
impacted file.

In practice I attempt to cover this by (1) saving the stream to local
disk and then (2) rsync'ing the first disk to a second in the same
cabinet.  If the file I just wrote is unreadable I should discover it at
(2), which hopefully is well before I actually need it in anger.  Disk
#2 then gets rotated out to an offsite vault on a regular schedule in
case the building catches fire or similar.  My exposure here is to
time-related bitrot which is a non-zero risk but I can't scrub a disk
that's sitting in a vault, so I don't know that there's a realistic
means around this risk other than a full online "hotsite" that I can
ship the snapshots to (which I don't have the necessary bandwidth or
storage to cover.)

If I change the backup media (currently UFS formatted) to ZFS formatted
and dump directly there via a zfs send/receive I could run both drives
as a mirror instead of rsync'ing from one to the other after the first
copy is done, then detach the mirror to rotate the drive out and attach
the other one, causing a resilver.  That's fine EXCEPT if I have a
controller go insane I now probably lose everything other than the
offsite copy since everything is up for write during the snapshot
operation.  That ain't so good and that's a risk I've had turn into
reality twice in 20 years.  On the upside if the primary has an error on
it I catch it when I try to resilver as that operation will fail since
the entire data structure that's on-disk and written has to be traversed
and the checksums should catch any silent corruption. If that happens I
know I'm naked (other than the vault copy which I hope is good!) until I
replace the backup drive with the error and re-copy everything.

What I have trouble quantifying is which is the LARGER risk; I've yet to
have a backup drive that is unreadable when I needed it, and I do test
my restore capability pretty regularly, but twice in 20 years I've had
active disk adapters in running machines destroy every write-mounted
drive that was attached to them without warning.  Both times the pucker
factor went off the charts as soon as I realized what had happened as
from an operational perspective it was pretty-much identical to a
tornado or fire destroying the machine.

> The "recommended" approach is to do zfs send | zfs recv and store a
> replica of your pool (with whatever level of RAID that meets your
> needs).  This way, you immediately detect an error in the send stream
> and can repeat the send.  You then use scrub to verify (and recover)
> the replica.
I'm contemplating how to set that up in a way that works and has a
reasonable associated operational profile for putting it into practice. 
What I do now leaves the backup volumes unmounted except when actually
being written to, which decreases (but does not completely eliminate)
the risk of an insane controller scribbling on the backup volumes.  
Setting read-only on the volumes doesn't help me at a filesystem level
as the risk here is that of insane software and the days of a nice
physical WRITE PROTECT switch on the front of a drive carrier are long
in the past.

I am also concerned about what happens as volume space grows beyond what
can be saved on "X" devices and the problems associated with that.  I've
long since moved to using disk drives as a catalog for data streams
rather than actual sequential media (e.g. tapes) due to the ridiculous
imbalance in cost between high-capacity DLT-style drives and disks of
equivalent storage, never mind transfer rates.

One of the challenges that I see with ZFS is that it appears that a
bogus block somewhere on a non-redundant medium may block future access
to the entire pool.  I'm not sure if that's actually the case or if you
can read around the error, but if IS the case it's a serious problem. 
UFS doesn't suffer from that; it will return errors on the file(s)
impacted but if you avoid touching those you can read the rest of the
pack and the data on it, assuming the failure is not total.

ZFS doesn't really invalidate the entire pool on one unrecoverable
error, does it?  (The documentation is not at all clear if this is the
case or not.)
>> (Yes, I know, I've been a ZFS resister.... ;-))
> "Resistance is futile." 

You know what happened to the Borg in the end, right? ;-)

-- 
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>
Cuda Systems LLC