Consistent inodes between distinct machines

Sat May 3 19:53:44 UTC 2008

On 2008.05.03. 20:51, Bernd Walter wrote:
> On Sat, May 03, 2008 at 08:09:25PM +0200, Attila Nagy wrote:
>   
>> Hello,
>>
>> On 2008.05.03. 14:50, Bernd Walter wrote:
>>     
>>> On Fri, May 02, 2008 at 03:40:11PM -0500, Eric Anderson wrote:
>>>  
>>>       
>>>> On Apr 17, 2008, at 3:43 AM, Attila Nagy wrote:
>>>>         
>>> Nevertheless I think that the UFS/NFS combo is not very good for this
>>> problem.
>>>  
>>>       
>> I don't think so. I need a stable system and UFS/NFS is in that state in 
>> FreeBSD.
>>     
>
> ZFS is pretty stable as well, although it has some points you need
> to care and tune about.
>   
I have (had, switched back one to UFS) two machines with ZFS. One i386 
and one amd64. Both kept crashing or freezing, so I don't consider ZFS 
pretty stable ATM. :(

> Havn't though about this.
> Of course this is a real problem.
> Have you tried the following:
> Setup Server A with all required ZFS filesystems.
> Replicate everything to Server B using dd.
> Then the filesystem ID should be the same on both systems.
> This will not work for newly created filesystems however and you may
> need to take extra care about not accidently change disks between the
> machines, since they have the same disk IDs as well.
> I admit - not very perfect :(
>   
Haven't tried that -but thought of it-, because I would need a bunch of 
new filesystems for snapshotting and synchronizing and I would like to 
dd tens of gigabytes every time to all of the NFS servers over the network.
>> Yes, that's why I thought of this in the first place. But there is 
>> another problem, which hits us today (with the loopbacked image mount) 
>> as well: you have to unmount the image and restart the NFS server (it 
>> can panic the machine otherwise), so we have to flip the active state 
>> from one machine to the other during the sync.
>>     
>
> Of course you have to do this - readonly mounts mean not writing, but
> it doesn't mean not caching metadata and expecting the underlying media
> to change contents, so to stay in sync you have to remount.
>   
I am very well aware of that.
If it would work, I would choose a geom_gate solution with one RW 
machine and many RO ones with a mirror formed from them.
Of course that's still not perfect, so ZFS's mirroring would be a better 
fit (due to incremental updates).
But sadly, it's not possible (AFAIK with "standard" methods) to run 
systems like that.
>   
>> The exact process looks like this:
>> - rsync the image to the inactive server
>> - when it's done, remount the image and restart the nfsd
>>     
>
> You also have to sync the image to a different file, since you can't
> pollute the original file with new content, while it is mounted.
>   
I am doing this for years without any ill effects. Of course I don't 
access the filesystem while it's synced. I'm just lazy to umount it, but 
you are right, that's the correct way.
> But with propper (IIRC default) options rsync already writes a new
> file and than exchanges it with the old one.
>   
Yes, I use inplace syncing, because I don't have that much space available.
>   
>> Currently I'm experimenting with a silly kernel patch, which replaces 
>> the following arc4random()s with a constant value:
>> ./ffs/ffs_alloc.c:              ip->i_gen = arc4random() / 2 + 1;
>> ./ffs/ffs_alloc.c:              prefcg = arc4random() % fs->fs_ncg;
>> ./ffs/ffs_alloc.c:                      dp2->di_gen = arc4random() / 2 + 1;
>> ./ffs/ffs_vfsops.c:             ip->i_gen = arc4random() / 2 + 1;
>>
>> It seems that this works when I don't use soft updates on the volumes. 
>>     
>
> But it is very fragile and it is there for a good reason.
>   
For a normal filesystem, yes.
> Namely to distribute the allocated inodes over the media and since
> AFAIK at leasy small files have their data allocated near the inode
> you influece data distribution as well.
> This will very likely lead to lower speed after some usage.
>   
Because these are mostly RO (only RW while updating, which is a slow 
process anyway) volumes, used for serving NFS clients, I don't think it 
will matter that much. But I'll see.
Currently this is the best I could came up with.

> Honestly said - I wouldn't trust that very much.
> Say you use two disk stations with fibre channel, which are connetced to
> two hosts.
> Use the disk stations with different power supply rails.
> Then use a solid constructed single server and have the same machine
> as cold or maybe already booted standby.
> Use the disk stations to mirror - one half on each station.
> If the host dies you can easily take over the service to the other
> machine by just mounting the disks.
> If you do this with ZFS it even takes care that the original host will
> not automatically mount them, since the host-id for the pool has been
> changed to that of the other host.
> It is not a hot standby as your solution, but talking about service
> failures I would assume this will outperform any hackish solution.
> I see so many people trying to do freaky failover with additional
> complexity and additional failure points, instead of just to increase
> the quality of their hardware.
>
>   
The above servers are providing NFS to FreeBSD and Linux netboot clients 
(clients are at many sites, running the real services behind load 
balancers, BGP anycast routing, whatever you like). The NFS servers here 
have the function of rapid deployment (put some new machines in the 
server pool X), centralised management (only have to make the 
configuration and OS changes in one place), etc.

So I'm not trying to build a highly available general cluster (with 
NFS), but a highly available NFS server for netbooted clients.
And commercial NASes aren't better at all (at least this is what I've 
seen so far), most of them are not shared nothing systems with 
affordable, reliable multisite replication capabilities.