Fail-over SAN setup: ZFS, NFS, and ...?

Wed Jun 24 23:05:55 UTC 2009

[Not exactly sure which ML this belongs on, as it's related to both
clustering and filesystems.  If there's a better spot, let me know and I'll
update the CC:/reply-to.]

We're in the planning stages for building a multi-site, fail-over SAN setup
which will be used to provide redundant storage for a virtual machine setup.
 The setup will be like so:
   [Server Room 1]      .      [Server Room 2]
  -----------------     .    -------------------
                        .
  [storage server]      .     [storage server]
          |             .             |
          |             .             |
   [storage switch]     .      [storage switch]
                 \----fibre----/      |
                        .             |
                        .             |
                        .   [storage aggregator]
                        .             |
                        .             |
                        .     /---[switch]---\
                        .     |       |      |
                        .     |   [VM box]   |
                        .     |       |      |
                        .  [VM box]   |      |
                        .     |       |  [VM box]
                        .     |       |      |
                        .     [network switch]
                        .             |
                        .             |
                        .         [internet]

Server room 1 and server room 2 are on opposite ends of town (about 3 km)
with a dedicated, direct-link, fibre link between them.  There will be a set
of VM boxes at each site, that use the shared storage, and will act as
fail-over for each other.  In theory, only 1 server room would ever be
active at a time, although we may end up migrating VMs between the two sites
for maintenance purposes.

We've got the storage server side of things figured out (5U rackmounts with
24 drive bauys, using FreeBSD 7.x and ZFS).  We've got the storage switches
picked out (HP Procurve 2800 or 2900, depending on if we go with 1 GbE or 10
GbE fibre links between them).  We're stuck on the storage aggregator.

For a single aggregator box setup, we'd use FreeBSD 7.x with ZFS.  The
storage servers would each export a single zvol using iSCSI.  The storage
aggregator would use ZFS to create a pool using a mirrored vdev.  To expand
the pool, we put in two more storage servers, and add another mirrored vdev
to the pool.  No biggie.  The storage aggregator then uses NFS and/or iSCSI
to make storage available to the VM boxes.  This is the easy part.

However, we'd like to remove the single-point-of-failure that the storage
aggregator represents, and have a duplicate of it running at Server Room 1.
 Right now, we can do this using cold-spares that rsync from the live box
every X hours/days.   We'd like this to be a live, fail-over spare, though.
 And this is where we're stuck.

What can we use to do this?  CARP?  Heatbeat?  ggate?  Should we look at
Linux with DRBD or linux-ha or cluster-nfs or similar?  Perhaps RedHat
Cluster Suite?  (We'd prefer not to, as then storage management becomes a
nightmare again, requiring mdadm, lvm, and more.)  Would a cluster
filessytem be needed?  AFS or similar?

We have next to no knowledge of fail-over clustering when it comes to
high-availability and fail-over.  Any pointers to things to read online, or
tips, or even "don't do that, you're insane" comments greatly appreciated.
 :)

Thanks.
-- 
Freddie Cash
fjwcash at gmail.com