HAST + ZFS + NFS + CARP

Jordan Hubbard jkh at ixsystems.com
Fri Jul 1 17:54:30 UTC 2016


> On Jun 30, 2016, at 11:57 AM, Julien Cigar <julien at perdition.city> wrote:
> 
> It would be more than welcome indeed..! I have the feeling that HAST
> isn't that much used (but maybe I am wrong) and it's difficult to find 
> informations on it's reliability and concrete long-term use cases...

This has been a long discussion so I’m not even sure where the right place to jump in is, but just speaking as a storage vendor (FreeNAS) I’ll say that we’ve considered HAST many times but also rejected it many times for multiple reasons:

1. Blocks which are found to be corrupt by ZFS (fail checksum) get replicated by HAST nonetheless since it has no idea - it’s below that layer.  This means that both good data and corrupt data are replicated to the other pool, which isn’t a fatal flaw but it’s a lot nicer to be replicating only *good* data at a higher layer.

2. When HAST systems go split-brain, it’s apparently hilarious.  I don’t have any experience with that in production so I can’t speak authoritatively about it, but the split-brain scenario has been mentioned by some of the folks who are working on clustered filesystems (glusterfs, ceph, etc) and I can easily imagine how that might cause hilarity, given the fact that ZFS has no idea its underlying block store is being replicated and also likes to commit changes in terms of transactions (TXGs), not just individual block writes, and writing a partial TXG (or potentially multiple outstanding TXGs with varying degrees of completion) would Be Bad.

3. HAST only works on a pair of machines with a MASTER/SLAVE relationship, which is pretty ghetto by today’s standards.  HDFS (Hadoop’s filesystem) can do block replication across multiple nodes, as can DRDB (Distributed Replicated Block Device), so chasing HAST seems pretty retro and will immediately set you up for embarrassment when the inevitable “OK, that pair of nodes is fine, but I’d like them both to be active and I’d also like to add a 3rd node in this one scenario where I want even more fault-tolerance - other folks can do that, how about you?” question comes up.

In short, the whole thing sounds kind of MEH and that’s why we’ve avoided putting any real time or energy into HAST.  DRDB sounds much more interesting, though of course it’s Linux-only.  This wouldn’t stop someone else from implementing a similar scheme in a clean-room fashion, of course.

And yes, of course one can layer additional things on top of iSCSI LUNs, just as one can punch through LUNs from older SAN fabrics and put ZFS pools on top of them (been there, done both of those things), though of course the additional indirection has performance and debugging ramifications of its own (when a pool goes sideways, you have additional things in the failure chain to debug).  ZFS really likes to “own the disks” in terms of providing block-level fault tolerance and predictable performance characteristics given specific vdev topologies, and once you start abstracting the disks away from it, making statements about predicted IOPs for the pool becomes something of a “???” exercise.

- Jordan



More information about the freebsd-fs mailing list