EBS snapshot backups from a FreeBSD zfs file system: zpool freeze?

Tue Jul 9 00:05:26 UTC 2013

On Mon, Jul 08, 2013 at 03:37:46PM -0700, Freddie Cash wrote:
> On Mon, Jul 8, 2013 at 3:31 PM, Berend de Boer <berend at pobox.com> wrote:
> 
> > >>>>> "Freddie" == Freddie Cash <fjwcash at gmail.com> writes:
> >
> >     Freddie> At which point, it would make more sense taking the
> >     Freddie> discussion upstream to Illumos to find a way to quiesce a
> >     Freddie> ZFS pool in such a way that EBS backups would work.  Once
> >     Freddie> that is done, then it can filter downstream to FreeBSD,
> >     Freddie> Linux, and others.
> >
> > Great tip. Didn't know exactly if the ZFS implementation in FreeBSD
> > was forked or not. I see on their home page about submitting patches
> > :-)
> >
> 
> The FreeBSD implementation of ZFS isn't 100% identical to the Illumos (aka
> "reference") implementation, mainly due to GEOM; however, the FreeBSD ZFS
> maintainers try to keep it at feature parity with Illumos (and even push
> patches upstream that get added to Illumos).
> 
> Same with the Linux implementation of ZFS, although there are more changes
> made to that one to shoehorn it into that wonderful mess they call "a
> storage stack".  :)  There are a handful of features available in the
> ZFS-on-Linux implementation that aren't anywhere else (like "-o ashift="
> for zpool create/add).
> 
> All in all, the ZFS-using OS projects try to stay as close to the Illumos
> version as is reasonable for the OS.
> 
> It certainly would be interesting to have a "zfs freeze" and/or a "zpool
> freeze" (depending on where you want to quiesce things), but it may not
> play into how ZFS works (wanting to have complete control over the block
> devices, meaning no special magic underneath like block-level snapshots).
> :)  Or, it may be the "next great feature" of ZFS.  :)

Well back to his original statement, quoting:

> On Linux' file systems I can freeze a file system, start the backup of
> all disks, and unfreeze. This freeze usually only takes 100ms or so.

I interpret this statement to mean, on Linux:

1. Some command is issued at the filesystem level that causes all I/O
operations (read and write) directed to/from that filesystem to block
(wait) indefinitely, and that all pending queued writes to the disk
are flushed to disk (on FreeBSD we would call this BIO_FLUSH),

2. Some other command is issued (at the Amazon EBS level, whether it be
done via a web page or via CLI commands on the same Linux box -- though
I don't know how that would work unless the CLI tools are on a
completely separate filesystem), where an EBS snapshot is taken (similar
to a filesystem snapshot but at the actual storage level,  Possibly
if this is a Linux command there's an actual device driver that sits
between the storage layer and EBS which can effectively "halt" or
"control" things in some manner (would not be surprised!  VMs often offer
this) -- I'll call this a "shim",

3. Some command is issued at the filesystem level that releases that
block/wait, and all future I/O requests go through.

What this means is that "block-level snapshots" are what would be
necessary -- the key here is that writes pending (scheduled to be
written to the disk) need to be flushed, and that any other I/O block.
I do not think something like CACHE FLUSH EXT (i.e. the ATA command used
to actually flush disk-level cache to the platters) matters -- EBS,
whether the data is "in its cache" or not has no bearing, it should know
what to do in either case.  All this would be because of what EBS would
require/mandate.

On FreeBSD we don't have the Linux equivalent of #1/#3 -- the layer
where this would be done, ideally, is at the GEOM level (ex. "gfreeze"
command would block all I/O and also issue BIO_FLUSH to ensure things
had been written).

Due to the split between GEOM and filesystems (unrelated things per se),
one would have to issue "gfreeze" on the disks that make up the
filesystem, followed by doing the EBS backup/snapshot, followed by
"gfreeze -u" on all the disks.  Wishful thinking, and very idealistic,
but that's my take on it.

I have no idea how you'd issue this command to select disks without
there being some risk; i.e. if a 5-disk raidz1, you'd issue that command
5 times (even if just in 1 single command, the kernel still has to
iterate over 5 items linearly), which means there's a chance the
filesystem could have successfully written parts of something to some of
those 5 disks, thus upon an EBS snapshot restore the filesystem is
actually inconsistent (ZFS reporting checksum failures, for example).

I have no idea how at the filesystem level (ex. zfs, not zpool) such
could be accomplished because again BIO_FLUSH is what's needed, and that
would be at the "provider" level (GEOM term) -- I think (kernel folks
please correct me).

I also have no idea how other layers (ex. CAM) would react to such a
"freeze".  Likewise, I worry about userland applications; 100ms is a
nice and convenient number... ;-)

On FreeBSD I think what most folks do is avoid all of the above and use
filesystem snapshots exclusively, either ZFS or UFS, although UFS
snapshots... well... don't get me started.

Filesystem snapshots are "supposed" to be fast, but they depend greatly
on a lot of things and how they're implemented.  But honestly they're
what most people turn to, rather than doing backups at the "block level"
(e.g. EBS).  I've never encountered anything like a "block level" freeze
or snapshot on bare metal (this would have to be done somehow at the
controller level; SANs have this, I believe, but not simple HBAs that
I've worked with).

One can't even do something like extend sync(8) to somehow issue
BIO_FLUSH, because it doesn't guarantee contention between the BIO_FLUSH
and the time things are done -- more writes could enter the queue or
maybe enough that the queue is full + gets processed right then and
there, leading to the same situation.

This whole thing is a mess due to the layers of disconnect between all
the pieces (including on Linux -- it just so happens they have some
interesting way with **very specific filesystems** to accomplish
this task), and if you ask me, a complete disconnect from reality
between the "cloud providers" (Amazon, etc.) and how actual storage and
filesystems *work*.  Very naughty assumptions being made on their
part, unless, of course, there is that "shim" I spoke about.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |