FreeBSD & no single point of failure file service

Wed Mar 20 00:36:17 UTC 2013

On Sat, Mar 16, 2013 at 8:48 PM, Michael DeMan <freebsd at deman.com> wrote:

> I was thinking to maybe test something out like:
>

> #1.  A couple old Dell 2970s head units with LSI cards.
> #2.  One dual-port SAS chassis.
> #3.  Figure out what needs to happen with devd+carp in order for the head
> end units to REALIBLY know when to export/import ZFS and when to advertise
> NFS/iSCSI, etc.
>

I was trying to figure out if it could be tested with a couple of virtual
machines pointed at the same shared disk image. :)

> A couple catches with this of course is that for #3 there could be some
> kind of unexpected heartbeat failure between the two head end units where
> they both decide the other is gone and both become masters - which would
> probably result in catastrophic corruption on the file system.
>

I think you almost need three (or more) participants, rather than two.
 Then, the participants elect a master, and if you don't have a majority
(e.g. two out of three votes), you didn't win.  Only two of them need to be
connected to the actual disks.  The additional voter(s) could be one or
more consumers of the filesystem services, which would tend to help keep
the available one winning the master role in a split-brain scenario.

That probably needs to be complicated slightly, as the export/import
process isn't anything like instant.  So if you get that scenario where the
master loses connectivity to the clients but not the FS and you still need
to promote a new master -- or you want to do manual failover for
maintenance reasons -- you do need to make sure "export" finishes before
"import" starts.

You could wait until you've been master for X seconds before starting your
import (where maybe X ~= 30), and the whole world will wait with you.

Another alternative would be some sort of shared permanent storage, like a
non-ZFS partition or drive upon which the master writes a timestamp, and
the slave reads it.  You don't touch the drives until either the timestamp
says it's all clear or the timestamp is X seconds old.  But then you run
into all those goofy shared disk read caching issues, and I'm not at all
sure you can peek at one partition of a SAS drive while another partition
is mounted on another system.  (The alternative being to dedicate two
drives for that purposes, which two drives to share one 512 byte sector
sounds terribly wasteful.)

The third possibility would be to do it without shared storage: a machine
could just broadcast "I'm touching the drives!" every second and a
newly-elected master would have to wait until those messages stop for X
seconds or until it sees "I'm not touching the drives!" before proceeding.
 That would be a little less reliable if the newly-elected master rebooted
unless each machine keeps a persistent copy in local storage.

In that scheme, you would just have to make sure you started/stopped things
in the right order.

Start:
1. Start greedy shouter.
2. Import ZFS pool.
3. ifup service interface.  (Arguably doesn't even need CARP at this point.)
4. Start NFS/iSCSI

Stop:
1. Stop NFS/iSCSI.
2. Ifdown service interface.
3. Export ZFS pool.
4. Stop greedy shouter.

CARP loses a lot of value because it's not like TCP sessions for NFS or
iSCSI can live migrate between machines anyway, but might still be useful
to make sure the interface IPs have the same MAC address.  Either way, I
think the interface in question should be explicitly marked up/down rather
than utilizing CARP for automatic interface failover.  I don't think it's a
good idea for a service IP to jump to a machine if it's 100% certain that
that machine won't be ready.  That is particularly true in the case of a
previously-down master returning to service alongside a working new master.

Of course the simplest solution of all is just to not implement automated
failover right away.  If the machines are there and configured and there is
24x7 admin, just make sure they always boot up in standby mode and have to
be manually promoted to master.  The time it would take for an admin to log
in to the standby server and type "the_student_is_now_the_master.sh" is
still probably a huge improvement over whatever the present state of
affairs is. :)

That would allow some time to examine real-world failure cases in a bit
more detail, observe the decisions the admin makes about when to fail over,
and maybe come up with a better / more resilient design that better models
those decisions.

SuperMicro does have that one chassis that accepts lots of drives and two
> custom motherboards that are linked internally via 10GB - I think ixsystems
> uses that.  So in theory the edge case of the accidental 'master/master'
> configuration is helped by hardware.  By the same token I am skeptical of
> having both head end units in a single chassis.  Pardon me for being
> paranoid.
>

I tried to convince myself "it's OK as long as they only common part is
sheet metal."  But yes, I've seen that and as cool as it looks, it makes me
nervous too.

> The hard work is always in the details, not the design?
>

Too right.

Of course there's a whole other category of problems, like those where ZFS
can run with a failed cache dev but sometimes won't import without it.
 Hopefully those types of problems are mostly behind us.  I know I still
read a lot of stuff on this list about ZFS that makes me even more nervous
than putting all my eggs in one sheet metal basket.

Thanks!