Race in NFS lookup can result in stale namecache entries

Thu Jan 19 15:50:49 UTC 2012

On Wednesday, January 18, 2012 6:52:57 pm Rick Macklem wrote:
> John Baldwin wrote:
> > I recently encountered a problem at work with a very stale name cache
> > entry. A directory was renamed on one NFS client and a new directory
> > was created with the same name. On another NFS client, both the old
> > and new pathnames resolved to the old filehandle and stayed that way
> > for days. It was only fixed by touching the parent directory which
> > forced the "wrong" NFS client to flush name cache entries for the
> > directory and repopulate it via LOOKUPs. I eventually figured out the
> > race condition that triggered this and was able to reproduce it. (I
> > had to hack up the NFS client to do some explicit sleeps to order the
> > steps right to trigger the race however. It seems to be very rare in
> > practice.) The root cause for the stale entry being trusted is that
> > each per-vnode nfsnode structure has a single 'n_ctime' timestamp used
> > to validate positive name cache entries. However, if there are
> > multiple entries for a single vnode, they were all sharing a single
> > timestamp. Assume you have three threads spread across two NFS
> > clients (R1 on the client doing the directory rename, and T1 and T2 on
> > the "victim" NFS client), and assume that thread S1 represents the NFS
> > server and the order it completes requests. Also, assume that $D
> > represents a parent directory where the rename occurs and that the
> > original directory is named "foo". Finally, assume that F1 is the
> > original directory's filehandle, and F2 is the new filehandle.
> > Time increases as the graph goes down:
> > 
> > R1 T1 T2 S1
> > ------------- ------------- ------------- ---------------
> > LOOKUP "$D/foo"
> > (1)
> > REPLY (1) "foo" F1
> > start reply
> > processing
> > up to
> > extracting
> > post-op attrs
> > RENAME "$D/foo"
> > "$D/baz" (2)
> > REPLY (2)
> > GETATTR $D
> > during lookup
> > due to expiry
> > (3)
> > REPLY (3)
> > flush $D name
> > cache entries
> > due to updated
> > timestamp
> > LOOKUP "$D/baz"
> > (4)
> > REPLY (4) "baz" F1
> > process reply,
> > including
> > post-op attrs
> > that set F1's
> > cached attrs
> > to a ctime
> > post RENAME
> > 
> > resume reply finish reply
> > processing processing
> > including including
> > setting F1's setting F1's
> > n_ctime and n_ctime and
> > adding cache adding cache
> > entry entry
> > 
> > At the end of this, the "victim" NFS client now has two name cache
> > entries for "$D/foo" and "$D/baz" that point to the F1 filehandle.
> > The n_ctime used to validate these name cache hits in nfs_lookup() is
> > already updated to post RENAME, so nfs_lookup() will trust these
> > entries until a future change to F1's i-node. Further, "$D"'s local
> > attribute cache already reflects the updated ctime post RENAME, so it
> > will not flush it's name cache entries until a future change to the
> > directory.
> > 
> > The root problem is that the name cache entry for "foo" was added
> > using the wrong ctime. It really should be using the F1 attributes in
> > the post-op attributes from the LOOKUP reply, not from F1's local
> > attribute cache. However, just changing that is not sufficient.
> > There are still races with the calls to cache_enter() and updating
> > n_ctime.
> > 
> > What I concluded is that it would really be far simpler and more
> > obvious if the cached timestamps were stored in the namecache entry
> > directly rather than having multiple name cache entries validated by
> > shared state in the nfsnode. This does mean allowing the name cache
> > to hold some filesystem-specific state. However, I felt this was much
> > cleaner than adding a lot more complexity to nfs_lookup(). Also, this
> > turns out to be fairly non-invasive to implement since nfs_lookup()
> > calls cache_lookup() directly, but other filesystems only call it
> > indirectly via vfs_cache_lookup(). I considered letting filesystems
> > store a void * cookie in the name cache entry and having them provide
> > a destructor, etc. However, that would require extra allocations for
> > NFS lookups. Instead, I just adjusted the name cache API to
> > explicitly allow the filesystem to store a single timestamp in a name
> > cache entry by adding a new 'cache_enter_time()' that accepts a struct
> > timespec that is copied into the entry. 'cache_enter_time()' also
> > saves the current value of 'ticks' in the entry. 'cache_lookup()' is
> > modified to add two new arguments used to return the timespec and
> > ticks value used for a namecache entry when a hit in the cache occurs.
> > 
> > One wrinkle with this is that the name cache does not create actual
> > entries for ".", and thus it would not store any timestamps for those
> > lookups. To fix this I changed the NFS client to explicitly fast-path
> > lookups of "." by always returning the current directory as setup by
> > cache_lookup() and never bothering to do a LOOKUP or check for stale
> > attributes in that case.
> > 
> > The current patch against 8 is at
> > http://www.FreeBSD.org/~jhb/patches/nfs_lookup.patch
> > 
> > It includes ABI and API compat shims so that it is suitable for
> > merging to stable branches. For HEAD I would likely retire the
> > cache_lookup_times() name and just change all the callers of
> > cache_lookup() (there are only a handful, and nwfs and smbfs might
> > benefit from this functionality anyway).
> >
> It sounds good to me, although I haven`t yet looked at the patch
> or thought about it much.
> 
> However, (and I think you`re already aware of this) given time clock
> resolution etc, as soon as multiple clients start manipulating the
> contents of a directory concurrently there is going to be a possibility
> of having a stale name cache entry. I think you`ve already mentioned this,
> but having a timeout on positive name cache entries like we did for
> negative name cache entries, will at least limit the effect of these.
> 
> For negative name cache entries, the little test I did showed that name
> cache hit was almost as good for a 30-60sec timeout as an infinite timeout.
> I suspect something similar might be true for positive name cache entries
> and it will be easy to do some measurements once it is coded.
> 
> If you would like, I can code up a positive name cache timeout similar
> to what you did for the negative name cache entries or would you prefer
> to do so?

I already have a patch for that (though it is not relevant to this change).
It will be easy to update it though once this change is made (actually, it
will be simpler since it can re-use ncticks rather than adding a new 
n_ctime_ticks field to the nfsnode).  I had done it by adding a new 
'nametimeo' mount option that defaulted to 60 seconds.

-- 
John Baldwin