Options for synchronising filesystems

Mon Sep 26 04:45:34 PDT 2005

Brian Candler wrote:
> Hello,
> 
> I was wondering if anyone would care to share their experiences in
> synchronising filesystems across a number of nodes in a cluster. I can think
> of a number of options, but before changing what I'm doing at the moment I'd
> like to see if anyone has good experiences with any of the others.
> 
> The application: a clustered webserver. The users' CGIs run in a chroot
> environment, and these clearly need to be identical (otherwise a CGI running
> on one box would behave differently when running on a different box).
> Ultimately I'd like to synchronise the host OS on each server too.
> 
> Note that this is a single-master, multiple-slave type of filesystem
> synchronisation I'm interested in.
> 
> 
> 1. Keep a master image on an admin box, and rsync it out to the frontends
> -------------------------------------------------------------------------
> 
> This is what I'm doing at the moment. Install a master image in
> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
> rsync it. [Actually I'm exporting it using NFS, and the frontends run rsync
> locally when required to update their local copies against the NFS master]
> 
> Disadvantages:
> 
> - rsyncing a couple of gigs of data is not particularly fast, even when only
> a few files have changed
> 
> - if a sysadmin (wrongly) changes a file on a front-end instead of on the
> master copy in the admin box, then the change will be lost when the next
> rsync occurs. They might think they've fixed a problem, and then (say) 24
> hours later their change is wiped. However if this is a config file, the
> fact that the old file has been reinstated might not be noticed until the
> daemon is restarted or the box rebooted - maybe months later. This I think
> is the biggest fundamental problem.
> 
> - files can be added locally and they will remain indefinitely (unless we
> use rsync --delete which is a bit scary). If this is done then adding a new
> machine into the cluster by rsyncing from the master will not pick up these
> extra files.
> 
> So, here are the alternatives I'm considering, and I'd welcome any
> additional suggestions too.

Here's a few ideas on this: do multiple rsyncs, one for each top level 
directory.  That might speed up your total rsync process.  Another 
similar method is using a content revisioning system.  This is only good 
for some cases, but something like subversion might work ok here.

> 2. Run the images directly off NFS
> ----------------------------------
> 
> I've had this running before, even the entire O/S, and it works just fine.
> However the NFS server itself then becomes a critical
> single-point-of-failure: if it has to be rebooted and is out of service for
> 2 minutes, then the whole cluster is out of service for that time.
> 
> I think this is only feasible if I can build a highly-available NFS server,
> which really means a pair of boxes serving the same data. Since the system
> image is read-only from the point of view of the frontends, this should be
> easy enough:
> 
>       frontends            frontends
>         | | |                | | |
>          NFS   ----------->   NFS
>        server 1    sync     server 2
> 
> As far as I know, NFS clients don't support the idea of failing over from
> one server to another, so I'd have to make a server pair which transparently
> fails over.
> 
> I could make one NFS server take over the other server's IP address using
> carp or vrrp. However, I suspect that the clients might notice. I know that
> NFS is 'stateless' in the sense that a server can be rebooted, but for a
> client to be redirected from one server to the other, I expect that these
> filesytems would have to be *identical*, down to the level of the inode
> numbers being the same.
> 
> If that's true, then rsync between the two NFS servers won't cut it. I was
> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
> block-identical read-only mirror image on NFS server 2 - this also has the
> advantage that any updates are close to instantaneous.
> 
> What worries me here is how NFS server 2, which has the mirrored filesystem
> mounted read-only, will take to having the data changed under its nose. Does
> it for example keep caches of inodes in memory, and what would happen if
> those inodes on disk were to change? I guess I can always just unmount and
> remount the filesystem on NFS server 2 after each change.

I've tried doing something similar.  I used fiber attached storage, and 
had multiple hosts mounting the same partition.  It seemed as though 
when host A mounted the filesystem read-write, and then host B mounted 
it read-only, any changes made by host A were not seen by B, and even 
remounting did not always bring it up to current state.  I believe it 
has to do with the buffer cache and host A's desire to keep things (like 
inode changes, block maps, etc) in cache and not write them to disk. 
FreeBSD does not currently have a multi-system cache coherency protocol 
to distribute that information to other hosts.  This is something I 
think would be very useful for many people.  I suppose you could just 
mount the filesystem when you know a change has happened, but you still 
may not see the change.  Maybe mounting the filesystem on host A with 
the sync option would help.

> My other concern is about susceptibility to DoS-type attacks: if one
> frontend were to go haywire and start hammering the NFS servers really hard,
> it could impact on all the other machines in the cluster.
> 
> However, the problems of data synchronisation are solved: any change made on
> the NFS server is visible identically to all front-ends, and sysadmins can't
> make changes on the front-ends because the NFS export is read-only.

This was my first thought too, and a highly available NFS server is 
something any NFS heavy installation wants (needs).  There are a few 
implementations of clustered filesystems out there, but non for FreeBSD 
(yet).   What that allows is multiple machines talking to a shared 
storage with read/write access.  Very handy, but since you only need 
read-only access, I think your problem is much simpler, and you can get 
away with a lot less.

> 3. Use a network distributed filesystem - CODA? AFS?
> ----------------------------------------------------
> 
> If each frontend were to access the filesystem as a read-only network mount,
> but have a local copy to work with in the case of disconnected operation,
> then the SPOF of an NFS server would be eliminated.
> 
> However, I have no experience with CODA, and although it's been in the tree
> since 2002, the README's don't inspire confidence:
> 
>    "It is mostly working, but hasn't been run long enough to be sure all the
>    bugs are sorted out. ... This code is not SMP ready"
> 
> Also, a local cache is no good if the data you want during disconnected
> operation is not in the cache at that time, which I think means this idea is
> not actually a very good one.

There is also a port for coda.  I've been reading about this,  and it's 
an interesting filesystem, but I'm just not sure of it's usefulness yet.

> 4. Mount filesystems read-only
> ------------------------------
> 
> On each front-end I could store /webroot/cgi on a filesystem mounted
> read-only to prevent tampering (as long as the sysadmin doesn't remount it
> read-write of course). That would work reasonably well, except that being
> mounted read-only I couldn't use rsync to update it!
> 
> It might also work with geom_mirror and ggated/ggatec, except for the issue
> I raised before about changing blocks on a filesystem under the nose of a
> client who is actively reading from it.

I suppose you could mount r/w only when doing the rsync, then switch 
back to ro once complete.  You should be able to do this online, without 
any issues or taking the filesystem offline.

> 5. Using a filesystem which really is read-only
> -----------------------------------------------
> 
> Better tamper-protection could be had by keeping data in a filesystem
> structure which doesn't support any updates at all - such as cd9660 or
> geom_uzip.
> 
> The issue here is how to roll out a new version of the data. I could push
> out a new filesystem image into a second partition, but it would then be
> necessary to unmount the old filesystem and remount the new on the same
> place, and you can't really unmount a filesystem which is in use. So this
> would require a reboot.
> 
> I was thinking that some symlink trickery might help:
> 
>     /webroot/cgi -> /webroot/cgi1
>     /webroot/cgi1     # filesystem A mounted here
>     /webroot/cgi2     # filesystem B mounted here
> 
> It should be possible to unmount /webroot/cgi2, dd in a new image, remount
> it, and change the symlink to point to /webroot/cgi2. After a little while,
> hopefully all the applications will stop using files in /webroot/cgi1, so
> this one can be unmounted and a new one put in its place on the next update.
> However this is not guaranteed, especially if there are long-lived processes
> using binary images in this partition. You'd still have to stop and restart
> all those processes.
> 
> If reboots were acceptable, then the filesystem image could also be stored
> in ramdisk pulled in via pxeboot. This makes sense especially for geom_uzip
> where the data is pre-compressed. However I would still prefer to avoid
> frequent reboots if at all possible. Also, whilst a ramdisk might be OK for
> the root filesystem, a typical CGI environment (with perl, php, ruby,
> python, and loads of libraries) would probably be too large anyway.
> 
> 
> 6. Journaling filesystem replication
> ------------------------------------
> 
> If the data were stored on a journaling filesystem on the master box, and
> the journal logs were distributed out to the slaves, then they would all
> have identical filesystem copies and only a minimal amount of data would
> need to be pushed out to each machine on each change. (This would be rather
> like NetApps and their snap-mirroring system). However I'm not aware of any
> journaling filesystem for FreeBSD, let alone whether it would support
> filesystem replication in this way.

There is a project underway for UFSJ (UFS journaling).   Maybe once it 
is complete, and bugs are ironed out, one could implement a journal 
distribution piece to send the journal updates to multiple hosts and 
achieve what you are thinking, however, that only distributes the 
meta-data, and not the actual data.

Good luck finding your ultimate solution!

Eric

-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------