NFS server fail-over - how do you do it?

Mon May 31 10:35:46 PDT 2004

We can live with the chance that a file write might fail as long as we can
switch over to another NFS server if the primary fails. So amd will help us
avoid the "client hung" issue? I will have to take a look. That is the worst
thing of all when it comes to a failed NFS server. You can't even remotely
reboot the NFS client! Someone has to power reset the damn thing. That's
bad.

On Sun, May 30, 2004 at 02:43:37AM -0500, adp wrote:
> I am running a FreeBSD 4.9-REL NFS server. Once every several hours our
main
> NFS server replicates everything to a backup FreeBSD NFS server. We are
okay
> with the gap in time between replication. What we aren't sure about is how
> to automate the fail-over between the primary to the secondary NFS server.
> This is for a web cluster. Each client mounts several directories from the
> NFS server.
>
> Let's say that our primary NFS server dies and just goes away. What then?
> Are you periodically doing a mount or a file look-up of a mounted
filesystem
> to check if your NFS server died? If so are you just unmounting and
> remounting everything using the backup NFS server?
>
> Just curious how this problem is being solved.

If you're mounting those NFS partitions read/write, then there really
isn't a good solution for this problem[1] -- you need your NFS server up
and running 24x7.

If you are NFS mounting those partitions read-only, then you can in
principle construct a fail-over system between those servers.  Some
Unix OSes let you specify a list of servers in fstab(5) (eg. Solaris)
and clients will mount from one or other of them.  Unfortunately you
can't do that with standard NFS mounts under FreeBSD.  You could try
using VRRP -- see the net/freevrrpd port for example -- but I'm not
sure how well that would work if the system failed-over in the middle
of an IO transaction.

In any case -- certainly if your NFS partitions are read/write, but
also for read-only, perhaps the best compromise is to use the
automounter amd(8) This certainly does help with the 'nightmare
filesystem' scenario, where loss of a server prevents the clients
doing anything, even rebooting cleanly.  You can create a limited and
rudimentary form of failover by using role-base hostnames in your
internal DNS -- eg nfsserv.example.com as a CNAME pointing at your
main server, and then modify the DNS when you need the failover to
occur.  It's a bit clunky and needs manual intervention, but it beats
having nothing at all.

 Cheers,

 Matthew

[1] Well, I assume you haven't got the resources to set up a storage
array with multiple servers accessing the same disk sets.

--
Dr Matthew J Seaman MA, D.Phil.                       26 The Paddocks
                                                      Savill Way
PGP: http://www.infracaninophile.co.uk/pgpkey         Marlow
Tel: +44 1628 476614                                  Bucks., SL7 1TH UK