kern/150143: [patch][tmpfs] Source directory vnode can disappear before locking it in tmpfs_rename

Thu Sep 9 05:42:21 UTC 2010

> Date: Wed, 8 Sep 2010 02:04:33 +0300
> From: Gleb Kurtsou <gleb.kurtsou at gmail.com>
> To: Kirk McKusick <mckusick at mckusick.com>
> Cc: Ivan Voras <ivoras at freebsd.org>, freebsd-fs at freebsd.org
> Subject: Re: kern/150143: [patch][tmpfs] Source directory vnode can
>  disappear before locking it in tmpfs_rename
> X-ASK-Info: Message Queued (2010/09/07 16:22:21)
> X-ASK-Info: Confirmed by User (2010/09/07 17:06:38)
> 
> Hello Kirk,
> 
> I was working on improving namecache during this summer, and I have to
> admit rename with the biggest problem of all, and it still remains.

Rename is very complex and hard to get right. I have made at least
five attempts to implement it and I am not convinced that it is yet right.

> There are several common approaches taken by filesystems.
> 
> UFS locks all vnodes involved in rename, unlocking, trying to lock
> vnodes and check for races, tmpfs does something similar (although vnode
> locking is incorrect, I'm going to fix it a bit later). 
> 
> Some others (like ext2fs and msdosfs if I'm not mistaken) keep locking
> at minimum, it seems to work, but honestly I don't see why it can't
> race.

Ext2fs (and most others) have a filesystem-wide lock that is held
whenever one is doing a rename which means that only one rename at
a time can take place. That greatly reduces the set of possible
races that one has to deal with. While this will obviously limit
rename intensive applications, I don't know of any practical
examples where this serialization matters. I am about ready to
throw in the towel and use this approach for UFS rename.

> ZFS is somewhat unique in this respect. It uses name locking, keeps
> per directory table of locked file names, i.e. names that can't change
> while in table. So that destination file won't be added during rename,
> source file can't disappear, etc.
> 
> What do you think about name locking approach taken by ZFS? Are there
> any drawbacks you are aware of?

A parallel rename can still move one of your parents in such a way that
you can end up loping off a branch of the tree:

		a
	      /  \
	     b    d
	    /      \
	   c        e

rename(b, a/d/e) & rename(d, a/b/c) could end up with b->c->d->e->b
in a loop divorced from the tree rooted by a. The current UFS locking
will catch this, but one merely tracking names may not. Note that
serializing these two renames ensure that this cannot happen as the
second rename will recognize that it is about to do something bad as
the first one will have finished before it starts. I have not studied
the ZFS solution, so they may in fact catch this possible race.

> I was thinking of trying to unify rename locking, either make UFS
> approach standard, i.e. lock all vnodes outside of rename or use name
> locking similar to ZFS. UFS way may not fit well into existing VOP API
> (extra vnode lookups to check for races) besides vnode locking order
> remains an important issue. ZFS style locks may be interesting in a way
> that they would allow to reduce scope of vnode locks, especially
> considering merging with ongoing work on rangelocks (just a guess).
> 
> Thanks,
> Gleb.

I do think that coming up with a common rename solution would be good.
If the ZFS code catches the known races, then that would be a good one
on which to standardize. Further scrutiny of the current UFS code may
show that we have indeed found all the races. But I fear that the only
implementable solution is to single-thread rename per filesystem.

I have copied Jeff Roberson on this email as he is likely to have
some insight on an optimal solution.

	Kirk McKusick