Options for synchronising filesystems

Sat Sep 24 07:06:54 PDT 2005

Hello,

I was wondering if anyone would care to share their experiences in
synchronising filesystems across a number of nodes in a cluster. I can think
of a number of options, but before changing what I'm doing at the moment I'd
like to see if anyone has good experiences with any of the others.

The application: a clustered webserver. The users' CGIs run in a chroot
environment, and these clearly need to be identical (otherwise a CGI running
on one box would behave differently when running on a different box).
Ultimately I'd like to synchronise the host OS on each server too.

Note that this is a single-master, multiple-slave type of filesystem
synchronisation I'm interested in.

1. Keep a master image on an admin box, and rsync it out to the frontends
-------------------------------------------------------------------------

This is what I'm doing at the moment. Install a master image in
/webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
rsync it. [Actually I'm exporting it using NFS, and the frontends run rsync
locally when required to update their local copies against the NFS master]

Disadvantages:

- rsyncing a couple of gigs of data is not particularly fast, even when only
a few files have changed

- if a sysadmin (wrongly) changes a file on a front-end instead of on the
master copy in the admin box, then the change will be lost when the next
rsync occurs. They might think they've fixed a problem, and then (say) 24
hours later their change is wiped. However if this is a config file, the
fact that the old file has been reinstated might not be noticed until the
daemon is restarted or the box rebooted - maybe months later. This I think
is the biggest fundamental problem.

- files can be added locally and they will remain indefinitely (unless we
use rsync --delete which is a bit scary). If this is done then adding a new
machine into the cluster by rsyncing from the master will not pick up these
extra files.

So, here are the alternatives I'm considering, and I'd welcome any
additional suggestions too.

2. Run the images directly off NFS
----------------------------------

I've had this running before, even the entire O/S, and it works just fine.
However the NFS server itself then becomes a critical
single-point-of-failure: if it has to be rebooted and is out of service for
2 minutes, then the whole cluster is out of service for that time.

I think this is only feasible if I can build a highly-available NFS server,
which really means a pair of boxes serving the same data. Since the system
image is read-only from the point of view of the frontends, this should be
easy enough:

      frontends            frontends
        | | |                | | |
         NFS   ----------->   NFS
       server 1    sync     server 2

As far as I know, NFS clients don't support the idea of failing over from
one server to another, so I'd have to make a server pair which transparently
fails over.

I could make one NFS server take over the other server's IP address using
carp or vrrp. However, I suspect that the clients might notice. I know that
NFS is 'stateless' in the sense that a server can be rebooted, but for a
client to be redirected from one server to the other, I expect that these
filesytems would have to be *identical*, down to the level of the inode
numbers being the same.

If that's true, then rsync between the two NFS servers won't cut it. I was
thinking of perhaps using geom_mirror plus ggated/ggatec to make a
block-identical read-only mirror image on NFS server 2 - this also has the
advantage that any updates are close to instantaneous.

What worries me here is how NFS server 2, which has the mirrored filesystem
mounted read-only, will take to having the data changed under its nose. Does
it for example keep caches of inodes in memory, and what would happen if
those inodes on disk were to change? I guess I can always just unmount and
remount the filesystem on NFS server 2 after each change.

My other concern is about susceptibility to DoS-type attacks: if one
frontend were to go haywire and start hammering the NFS servers really hard,
it could impact on all the other machines in the cluster.

However, the problems of data synchronisation are solved: any change made on
the NFS server is visible identically to all front-ends, and sysadmins can't
make changes on the front-ends because the NFS export is read-only.

3. Use a network distributed filesystem - CODA? AFS?
----------------------------------------------------

If each frontend were to access the filesystem as a read-only network mount,
but have a local copy to work with in the case of disconnected operation,
then the SPOF of an NFS server would be eliminated.

However, I have no experience with CODA, and although it's been in the tree
since 2002, the README's don't inspire confidence:

   "It is mostly working, but hasn't been run long enough to be sure all the
   bugs are sorted out. ... This code is not SMP ready"

Also, a local cache is no good if the data you want during disconnected
operation is not in the cache at that time, which I think means this idea is
not actually a very good one.

4. Mount filesystems read-only
------------------------------

On each front-end I could store /webroot/cgi on a filesystem mounted
read-only to prevent tampering (as long as the sysadmin doesn't remount it
read-write of course). That would work reasonably well, except that being
mounted read-only I couldn't use rsync to update it!

It might also work with geom_mirror and ggated/ggatec, except for the issue
I raised before about changing blocks on a filesystem under the nose of a
client who is actively reading from it.

5. Using a filesystem which really is read-only
-----------------------------------------------

Better tamper-protection could be had by keeping data in a filesystem
structure which doesn't support any updates at all - such as cd9660 or
geom_uzip.

The issue here is how to roll out a new version of the data. I could push
out a new filesystem image into a second partition, but it would then be
necessary to unmount the old filesystem and remount the new on the same
place, and you can't really unmount a filesystem which is in use. So this
would require a reboot.

I was thinking that some symlink trickery might help:

    /webroot/cgi -> /webroot/cgi1
    /webroot/cgi1     # filesystem A mounted here
    /webroot/cgi2     # filesystem B mounted here

It should be possible to unmount /webroot/cgi2, dd in a new image, remount
it, and change the symlink to point to /webroot/cgi2. After a little while,
hopefully all the applications will stop using files in /webroot/cgi1, so
this one can be unmounted and a new one put in its place on the next update.
However this is not guaranteed, especially if there are long-lived processes
using binary images in this partition. You'd still have to stop and restart
all those processes.

If reboots were acceptable, then the filesystem image could also be stored
in ramdisk pulled in via pxeboot. This makes sense especially for geom_uzip
where the data is pre-compressed. However I would still prefer to avoid
frequent reboots if at all possible. Also, whilst a ramdisk might be OK for
the root filesystem, a typical CGI environment (with perl, php, ruby,
python, and loads of libraries) would probably be too large anyway.

6. Journaling filesystem replication
------------------------------------

If the data were stored on a journaling filesystem on the master box, and
the journal logs were distributed out to the slaves, then they would all
have identical filesystem copies and only a minimal amount of data would
need to be pushed out to each machine on each change. (This would be rather
like NetApps and their snap-mirroring system). However I'm not aware of any
journaling filesystem for FreeBSD, let alone whether it would support
filesystem replication in this way.

Well, that's what I've come up with so far. I'd be very interested to hear
if people have any other strategies or suggestions, particularly with
practical experience in a clustered/ISP environment.

Regards,

Brian Candler.