FreeBSD & no single point of failure file service

Sun Mar 17 00:57:49 UTC 2013

Hi David,

We are looking at the exact same thing -  let me know what you find out.

I think it is pretty obvious that ixsystems.com has this figured out along with all the tricky details - but for the particular company I am looking to implement this for - vendors that can't show their prices for products are vendors we have to stay away from because not showing pricing means it starts at $100K minimum + giant annual support fees.  In all honesty some kind of 3rd party designed solution with only minimal support would be fine for us, but I don't think that is their regular market.

I was thinking to maybe test something out like:

#1.  A couple old Dell 2970s head units with LSI cards.
#2.  One dual-port SAS chassis.
#3.  Figure out what needs to happen with devd+carp in order for the head end units to REALIBLY know when to export/import ZFS and when to advertise NFS/iSCSI, etc.

A couple catches with this of course is that for #3 there could be some kind of unexpected heartbeat failure between the two head end units where they both decide the other is gone and both become masters - which would probably result in catastrophic corruption on the file system.

SuperMicro does have that one chassis that accepts lots of drives and two custom motherboards that are linked internally via 10GB - I think ixsystems uses that.  So in theory the edge case of the accidental 'master/master' configuration is helped by hardware.  By the same token I am skeptical of having both head end units in a single chassis.  Pardon me for being paranoid.

So what I came to the conclusion with #3 for home-brew design was that devd+carp is great overall, but there needs to be an additional out-of-band confirmation between the two head end units.

Scenario is:

#1-#2 above.

The head units are wired up such that they are providing storage and also running (hsrp/carp/vrrp) on their main link that they vend their storage resources off to the network.

They are also connected via another channel - this could be a x-over ethernet link, serial cable - or in my case simply re-use the dedicated ethernet port that is used for management-only access to the servers and is already out of band.

If a network engineer comes and tweaks around the NFS/iSCSI switches or something else, makes a mistake, and that link between the two head end units is broken - both machines are going to want to be masters, and write directly to whatever shared physical storage they have?

This is where the additional link between the head units comes in.  Storage delivery side of things has 'split brain' - head end units can not talk to each other, but may be able to talk to some (or all) clients that use their services.  With current design for ZFS v28 there can be only one master for utilizing the physical attached storage from the head ends - otherwise small problem that could have been better fixed by just having an outage turns into a potential loss of all the data everywhere?

So basically failover between the head units works as follows:

A) I am secondary on the big storage ethernet link and the primary has timed out on telling me it is still alive.
B) Confirm on the out-of-band link whether the primary is still up or not, and what it thinks the state of affairs may be.  (optimize by starting this check 1st time primary heartbeat is lost - not after timeout?)
C) If the primary thinks it has lost connectivity to the clients then confirm it is also not longer acting as a primary for the physical storage, and I should attach the storage and try to become the primary.
D) ??? If the primary thinks it still can connect to the clients, then what?
E) From (C) above - lets be sure to avoid a flapping situation.
F) No matter what, if the state of which head end unit should be the 'master' (vending NFS/iSCSI and also handling the physical storage) - then both units should deny services?

Longer e-mail than I expected.  Thanks for the post - it made me think about things.  Probably there are huge problems in my above synopsis.  The hard work is always in the details, not the design?
- Mike

On Mar 9, 2013, at 3:40 PM, J David <j.david.lists at gmail.com> wrote:

> Hello,
> 
> I would like to build a file server with no single point of failure, and I
> would like to use FreeBSD and ZFS to do it.
> 
> The hardware configuration we're looking at would be two servers with 4x
> SAS connectors and two SAS JBOD shelves.  Both servers would have dual
> connections to each shelf.
> 
> The disks would be configured in mirrored pairs, with one disk from each
> pair in each shelf.  One pair for ZIL, one or two pairs for L2ARC, and the
> rest for ZFS data.
> 
> We would be shooting for an active/standby configuration where the standby
> system is booted up but doesn't touch the bus unless/until it detects CARP
> failover from the master via devd, then it does a zpool import.  (Even so
> all TCP sessions for NFS and iSCSI will get reset, which seems unavoidable
> but recoverable.)
> 
> This will be really expensive to test, so I would be very interested if
> anyone has feedback on how FreeBSD will handle this type of shared-SAS
> hardware configuration.
> 
> Thanks for any advice!
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"