HAST + ZFS self healing? Hot spares?

Per von Zweigbergk pvz at itassistans.se
Wed May 18 06:29:51 UTC 2011


I've been investigating HAST as a possibility in adding synchronous replication and failover to a set of two NFS servers backed by NFS. The servers themselves contain quite a few disks. 20 of them (7200 RPM SAS disks), to be exact. (If I didn't lose count again...) Plus two quick but small SSD's for ZIL and two not-as-quick but larger SSD's for L2ARC.

These machines weren't originally designed with synchronous replication in mind - they were designed to be NFS file servers (used as VMware data stores) backed by ZFS. They contain LSI MegaRaid 9260 controllers (as an aside, these were perhaps not the best choice for ZFS since they lack a true JBOD mode, I have worked around this by making single-disk RAID-0 arrays, and then using those single-disk arrays to make up the zpool).

Now, I've been considering making an active/passive (or, possibly, active/passive + passive/active) synchronously replicated pair of servers out of these, and my eyes fall on HAST.

Initially, my thoughts land on simply creating HAST resources for the corresponding pairs of disks and SSDs in servers A and B, and then using these HAST resources to make up the ZFS pool.

But this raises two questions:

---

1. Hardware failure management. In case of a hardware failure, I'm not exactly sure what will happen, but I suspect the single-disk RAID-0 array containing the failed disk will simply fail. I assume it will still exist, but refuse to be read or written. In this situation I understand HAST will handle this by routing all I/O to the secondary server, in case the disk on the primary side dies, or simply by cutting off replication if the disk on the secondary server fails.

I have not seen any "hot spare" mechanism in HAST, but I would think that I could edit the cluster configuration file to manually configure a hot spare in case I receive an alert. Would I have to restart all of hastd to do this, though? Or is it sufficient to bring the resource into init and back into secondary using hastctl?

Of course it may just be infinitely simpler just to configure spares on the ZFS level, and keep entire spare hast resources, and just do a zfs replace, replacing an entire array of two disks whenever one of the disks in an array fails. Still, it would be know what I can reconfigure on-the-fly with hast itself.

---

2. ZFS self-healing. As far as I understand it, ZFS does self-healing, in that all data is checksummed, and if one disk in a mirror happens to contain corrupted data, ZFS will re-read the same data from the other disk in the ZFS mirror. I don't see any way this could work in a configuration where ZFS is not mirroring itself, but rather, running on top of HAST, currently. Am I wrong about this? Or is there any way to achieve this same self-healing effect except with HAST?

---

So, what is it, do I have to give up ZFS's self healing (one of the really neat features in ZFS) if I go for HAST? Of course, I could mirror the drives first with HAST, and then mirror the two HAST mirrors using a zfs mirror, but that would be wasteful and a little silly. I might even be able to get away with using "copies=2" in this scenario. Or I could use raid-z on top of the mirrors, wasting less disk, but causing a performance hit.

I mean, ideally, ZFS would have a really neat synchronous replication feature built into it. Or ZFS could be HAST-aware, and know how to ask HAST to bring it a copy of a block of data on the remote block device in a HAST mirror in case the checksum on the local block device doesn't match. Or HAST would itself have some kind of block-level checksums, and do self-healing itself. (This would probably be the easiest to implement. The secondary site could even continually be reading the same data as the primary site is, merely to check the checksums on disk, not to send it over the wire. It's not like it's doing anything else useful with that untapped read performance.)

So, what's the current state of solving this problem? Is there any work being done in this area? Have I overlooked some technology I might use to achieve this goal?


More information about the freebsd-fs mailing list