Options for synchronising filesystems

Eric Anderson anderson at centtech.com
Mon Sep 26 05:46:16 PDT 2005

filip wuytack wrote:
> Eric Anderson wrote:
>> Brian Candler wrote:
>>> Hello,
>>> I was wondering if anyone would care to share their experiences in
>>> synchronising filesystems across a number of nodes in a cluster. I 
>>> can think
>>> of a number of options, but before changing what I'm doing at the 
>>> moment I'd
>>> like to see if anyone has good experiences with any of the others.
>>> The application: a clustered webserver. The users' CGIs run in a chroot
>>> environment, and these clearly need to be identical (otherwise a CGI 
>>> running
>>> on one box would behave differently when running on a different box).
>>> Ultimately I'd like to synchronise the host OS on each server too.
>>> Note that this is a single-master, multiple-slave type of filesystem
>>> synchronisation I'm interested in.
>>> 1. Keep a master image on an admin box, and rsync it out to the 
>>> frontends
>>> ------------------------------------------------------------------------- 
>>> This is what I'm doing at the moment. Install a master image in
>>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
>>> rsync it. [Actually I'm exporting it using NFS, and the frontends run 
>>> rsync
>>> locally when required to update their local copies against the NFS 
>>> master]
>>> Disadvantages:
>>> - rsyncing a couple of gigs of data is not particularly fast, even 
>>> when only
>>> a few files have changed
>>> - if a sysadmin (wrongly) changes a file on a front-end instead of on 
>>> the
>>> master copy in the admin box, then the change will be lost when the next
>>> rsync occurs. They might think they've fixed a problem, and then 
>>> (say) 24
>>> hours later their change is wiped. However if this is a config file, the
>>> fact that the old file has been reinstated might not be noticed until 
>>> the
>>> daemon is restarted or the box rebooted - maybe months later. This I 
>>> think
>>> is the biggest fundamental problem.
>>> - files can be added locally and they will remain indefinitely 
>>> (unless we
>>> use rsync --delete which is a bit scary). If this is done then adding 
>>> a new
>>> machine into the cluster by rsyncing from the master will not pick up 
>>> these
>>> extra files.
>>> So, here are the alternatives I'm considering, and I'd welcome any
>>> additional suggestions too.
>> Here's a few ideas on this: do multiple rsyncs, one for each top level 
>> directory.  That might speed up your total rsync process.  Another 
>> similar method is using a content revisioning system.  This is only 
>> good for some cases, but something like subversion might work ok here.
>>> 2. Run the images directly off NFS
>>> ----------------------------------
>>> I've had this running before, even the entire O/S, and it works just 
>>> fine.
>>> However the NFS server itself then becomes a critical
>>> single-point-of-failure: if it has to be rebooted and is out of 
>>> service for
>>> 2 minutes, then the whole cluster is out of service for that time.
>>> I think this is only feasible if I can build a highly-available NFS 
>>> server,
>>> which really means a pair of boxes serving the same data. Since the 
>>> system
>>> image is read-only from the point of view of the frontends, this 
>>> should be
>>> easy enough:
>>>       frontends            frontends
>>>         | | |                | | |
>>>          NFS   ----------->   NFS
>>>        server 1    sync     server 2
>>> As far as I know, NFS clients don't support the idea of failing over 
>>> from
>>> one server to another, so I'd have to make a server pair which 
>>> transparently
>>> fails over.
>>> I could make one NFS server take over the other server's IP address 
>>> using
>>> carp or vrrp. However, I suspect that the clients might notice. I 
>>> know that
>>> NFS is 'stateless' in the sense that a server can be rebooted, but for a
>>> client to be redirected from one server to the other, I expect that 
>>> these
>>> filesytems would have to be *identical*, down to the level of the inode
>>> numbers being the same.
>>> If that's true, then rsync between the two NFS servers won't cut it. 
>>> I was
>>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
>>> block-identical read-only mirror image on NFS server 2 - this also 
>>> has the
>>> advantage that any updates are close to instantaneous.
>>> What worries me here is how NFS server 2, which has the mirrored 
>>> filesystem
>>> mounted read-only, will take to having the data changed under its 
>>> nose. Does
>>> it for example keep caches of inodes in memory, and what would happen if
>>> those inodes on disk were to change? I guess I can always just 
>>> unmount and
>>> remount the filesystem on NFS server 2 after each change.
>> I've tried doing something similar.  I used fiber attached storage, 
>> and had multiple hosts mounting the same partition.  It seemed as 
>> though when host A mounted the filesystem read-write, and then host B 
>> mounted it read-only, any changes made by host A were not seen by B, 
>> and even remounting did not always bring it up to current state.  I 
>> believe it has to do with the buffer cache and host A's desire to keep 
>> things (like inode changes, block maps, etc) in cache and not write 
>> them to disk. FreeBSD does not currently have a multi-system cache 
>> coherency protocol to distribute that information to other hosts.  
>> This is something I think would be very useful for many people.  I 
>> suppose you could just mount the filesystem when you know a change has 
>> happened, but you still may not see the change.  Maybe mounting the 
>> filesystem on host A with the sync option would help.
>>> My other concern is about susceptibility to DoS-type attacks: if one
>>> frontend were to go haywire and start hammering the NFS servers 
>>> really hard,
>>> it could impact on all the other machines in the cluster.
>>> However, the problems of data synchronisation are solved: any change 
>>> made on
>>> the NFS server is visible identically to all front-ends, and 
>>> sysadmins can't
>>> make changes on the front-ends because the NFS export is read-only.
>> This was my first thought too, and a highly available NFS server is 
>> something any NFS heavy installation wants (needs).  There are a few 
>> implementations of clustered filesystems out there, but non for 
>> FreeBSD (yet).   What that allows is multiple machines talking to a 
>> shared storage with read/write access.  Very handy, but since you only 
>> need read-only access, I think your problem is much simpler, and you 
>> can get away with a lot less.
>>> 3. Use a network distributed filesystem - CODA? AFS?
>>> ----------------------------------------------------
>>> If each frontend were to access the filesystem as a read-only network 
>>> mount,
>>> but have a local copy to work with in the case of disconnected 
>>> operation,
>>> then the SPOF of an NFS server would be eliminated.
>>> However, I have no experience with CODA, and although it's been in 
>>> the tree
>>> since 2002, the README's don't inspire confidence:
>>>    "It is mostly working, but hasn't been run long enough to be sure 
>>> all the
>>>    bugs are sorted out. ... This code is not SMP ready"
>>> Also, a local cache is no good if the data you want during disconnected
>>> operation is not in the cache at that time, which I think means this 
>>> idea is
>>> not actually a very good one.
>> There is also a port for coda.  I've been reading about this,  and 
>> it's an interesting filesystem, but I'm just not sure of it's 
>> usefulness yet.
>>> 4. Mount filesystems read-only
>>> ------------------------------
>>> On each front-end I could store /webroot/cgi on a filesystem mounted
>>> read-only to prevent tampering (as long as the sysadmin doesn't 
>>> remount it
>>> read-write of course). That would work reasonably well, except that 
>>> being
>>> mounted read-only I couldn't use rsync to update it!
>>> It might also work with geom_mirror and ggated/ggatec, except for the 
>>> issue
>>> I raised before about changing blocks on a filesystem under the nose 
>>> of a
>>> client who is actively reading from it.
>> I suppose you could mount r/w only when doing the rsync, then switch 
>> back to ro once complete.  You should be able to do this online, 
>> without any issues or taking the filesystem offline.
>>> 5. Using a filesystem which really is read-only
>>> -----------------------------------------------
>>> Better tamper-protection could be had by keeping data in a filesystem
>>> structure which doesn't support any updates at all - such as cd9660 or
>>> geom_uzip.
>>> The issue here is how to roll out a new version of the data. I could 
>>> push
>>> out a new filesystem image into a second partition, but it would then be
>>> necessary to unmount the old filesystem and remount the new on the same
>>> place, and you can't really unmount a filesystem which is in use. So 
>>> this
>>> would require a reboot.
>>> I was thinking that some symlink trickery might help:
>>>     /webroot/cgi -> /webroot/cgi1
>>>     /webroot/cgi1     # filesystem A mounted here
>>>     /webroot/cgi2     # filesystem B mounted here
>>> It should be possible to unmount /webroot/cgi2, dd in a new image, 
>>> remount
>>> it, and change the symlink to point to /webroot/cgi2. After a little 
>>> while,
>>> hopefully all the applications will stop using files in 
>>> /webroot/cgi1, so
>>> this one can be unmounted and a new one put in its place on the next 
>>> update.
>>> However this is not guaranteed, especially if there are long-lived 
>>> processes
>>> using binary images in this partition. You'd still have to stop and 
>>> restart
>>> all those processes.
>>> If reboots were acceptable, then the filesystem image could also be 
>>> stored
>>> in ramdisk pulled in via pxeboot. This makes sense especially for 
>>> geom_uzip
>>> where the data is pre-compressed. However I would still prefer to avoid
>>> frequent reboots if at all possible. Also, whilst a ramdisk might be 
>>> OK for
>>> the root filesystem, a typical CGI environment (with perl, php, ruby,
>>> python, and loads of libraries) would probably be too large anyway.
>>> 6. Journaling filesystem replication
>>> ------------------------------------
>>> If the data were stored on a journaling filesystem on the master box, 
>>> and
>>> the journal logs were distributed out to the slaves, then they would all
>>> have identical filesystem copies and only a minimal amount of data would
>>> need to be pushed out to each machine on each change. (This would be 
>>> rather
>>> like NetApps and their snap-mirroring system). However I'm not aware 
>>> of any
>>> journaling filesystem for FreeBSD, let alone whether it would support
>>> filesystem replication in this way.
>> There is a project underway for UFSJ (UFS journaling).   Maybe once it 
>> is complete, and bugs are ironed out, one could implement a journal 
>> distribution piece to send the journal updates to multiple hosts and 
>> achieve what you are thinking, however, that only distributes the 
>> meta-data, and not the actual data.
> Have a look at dragonfly BSD for this. They are working on a journaling 
> filesystem that will do just that.

Do you have a link to some information on this?  I've been looking at 
Dragonfly, but I'm having trouble finding good information on what is 
already working, in planning, etc.


Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.

More information about the freebsd-cluster mailing list