Musings on ZFS Backup strategies

Sun Mar 3 04:57:47 UTC 2013

On 3/2/2013 10:23 PM, Ben Morrow wrote:
> Quoth Karl Denninger <karl at denninger.net>:
>> Quoth Ben Morrow:
>>> I don't know what medium you're backing up to (does anyone use tape any
>>> more?) but when backing up to disk I much prefer to keep the backup in
>>> the form of a filesystem rather than as 'zfs send' streams. One reason
>>> for this is that I believe that new versions of the ZFS code are more
>>> likely to be able to correctly read old versions of the filesystem than
>>> old versions of the stream format; this may not be correct any more,
>>> though.
>>>
>>> Another reason is that it means I can do 'rolling snapshot' backups. I
>>> do an initial dump like this
>>>
>>>     # zpool is my working pool
>>>     # bakpool is a second pool I am backing up to
>>>
>>>     zfs snapshot -r zpool/fs at dump
>>>     zfs send -R zpool/fs at dump | zfs recv -vFd bakpool
>>>
>>> That pipe can obviously go through ssh or whatever to put the backup on
>>> a different machine. Then to make an increment I roll forward the
>>> snapshot like this
>>>
>>>     zfs rename -r zpool/fs at dump dump-old
>>>     zfs snapshot -r zpool/fs at dump
>>>     zfs send -R -I @dump-old zpool/fs at dump | zfs recv -vFd bakpool
>>>     zfs destroy -r zpool/fs at dump-old
>>>     zfs destroy -r bakpool/fs at dump-old
>>>
>>> (Notice that the increment starts at a snapshot called @dump-old on the
>>> send side but at a snapshot called @dump on the recv side. ZFS can
>>> handle this perfectly well, since it identifies snapshots by UUID, and
>>> will rename the bakpool snapshot as part of the recv.)
>>>
>>> This brings the filesystem on bakpool up to date with the filesystem on
>>> zpool, including all snapshots, but never creates an increment with more
>>> than one backup interval's worth of data in. If you want to keep more
>>> history on the backup pool than the source pool, you can hold off on
>>> destroying the old snapshots, and instead rename them to something
>>> unique. (Of course, you could always give them unique names to start
>>> with, but I find it more convenient not to.)
>> Uh, I see a potential problem here.
>>
>> What if the zfs send | zfs recv command fails for some reason before
>> completion?  I have noted that zfs recv is atomic -- if it fails for any
>> reason the entire receive is rolled back like it never happened.
>>
>> But you then destroy the old snapshot, and the next time this runs the
>> new gets rolled down.  It would appear that there's an increment
>> missing, never to be seen again.
> No, if the recv fails my backup script aborts and doesn't delete the old
> snapshot. Cleanup then means removing the new snapshot and renaming the
> old back on the source zpool; in my case I do this by hand, but it could
> be automated given enough thought. (The names of the snapshots on the
> backup pool don't matter; they will be cleaned up by the next successful
> recv.)
I was concerned that if the one you rolled to "old" get killed without
the backup being successful then you're screwed as you've lost the
context.  I presume that zfs recv will properly set the exit code
non-zero if something's wrong (I would hope so!)
>> What gets lost in that circumstance?  Anything changed between the two
>> times -- and silently at that? (yikes!)
> It's impossible to recv an incremental stream on top of the wrong
> snapshot (identified by UUID, not by its current name), so nothing can
> get silently lost. A 'zfs recv -F' will find the correct starting
> snapshot on the destination filesystem (assuming it's there) regardless
> of its name, and roll forward to the state as of the end snapshot. If a
> recv succeeds you can be sure nothing up to that point has been missed.
Ah, ok.  THAT I did not understand.  So the zfs recv process checks what
it's about to apply the delta against, and if it can't find a consistent
place to start it garfs rather than screw you.  That's good.  As long as
it gets caught I can live with it.  Recovery isn't a terrible pain in
the butt so long as it CAN be recovered.  It's the potential for silent
failures that scare the bejeezus out of me for all the obvious reasons.
> The worst that can happen is if you mistakenly delete the snapshot on
> the source pool that marks the end of the last successful recv on the
> backup pool; in that case you have to take an increment from further
> back (which will therefore be a larger incremental stream than it needed
> to be). The very worst case is if you end up without any snapshots in
> common between the source and backup pools, and you have to start again
> with a full dump.
>
> Ben
Got it.

That's not great in that it could force a new "full copy", but it's also
not the end of the world.  In my case I am already automatically taking
daily and 4-hour snaps, keeping a week's worth around, which is more
than enough time to be able to obtain a consistent place to go from. 
That should be ok then.

I think I'm going to play with this and see what I think of it.  One
thing that is very attractive to this design is to have the receiving
side be a mirror, then to rotate to the vault copy run a scrub (to
insure that both members are consistent at a checksum level), break the
mirror and put one in the vault, replacing it with the drive coming FROM
the vault, then do a zpool replace and allow it to resilver into the
other drive.  You now have the two in consistent state again locally if
the pool pukes and one in the vault in the event of a fire or other
"entire facility is toast" event.

The only risk that makes me uncomfortable doing this is that the pool is
always active when the system is running.  With UFS backup disks it's
not -- except when being actually written to they're unmounted, and this
materially decreases the risk of an insane adapter scribbling the
drives, since there is no I/O at all going to them unless mounted. 
While the backup pool would be nominally idle it is probably
more-exposed to a potential scribble than the UFS-mounted packs would be.

The two times in my career I've gotten hosed by this my operative theory
is that something went wrong in the adapter code and it decided that
cache RAM pages "belonged" to a different disk than they really belonged
to.  That's the only explanation I can come up with that makes sense; in
both cases it resulted in effectively complete destruction of the data
on all mounted drives in the array.

-- 
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>
Cuda Systems LLC