Race in NFS lookup can result in stale namecache entries

Wed Jan 18 23:52:59 UTC 2012

John Baldwin wrote:
> I recently encountered a problem at work with a very stale name cache
> entry. A directory was renamed on one NFS client and a new directory
> was created with the same name. On another NFS client, both the old
> and new pathnames resolved to the old filehandle and stayed that way
> for days. It was only fixed by touching the parent directory which
> forced the "wrong" NFS client to flush name cache entries for the
> directory and repopulate it via LOOKUPs. I eventually figured out the
> race condition that triggered this and was able to reproduce it. (I
> had to hack up the NFS client to do some explicit sleeps to order the
> steps right to trigger the race however. It seems to be very rare in
> practice.) The root cause for the stale entry being trusted is that
> each per-vnode nfsnode structure has a single 'n_ctime' timestamp used
> to validate positive name cache entries. However, if there are
> multiple entries for a single vnode, they were all sharing a single
> timestamp. Assume you have three threads spread across two NFS
> clients (R1 on the client doing the directory rename, and T1 and T2 on
> the "victim" NFS client), and assume that thread S1 represents the NFS
> server and the order it completes requests. Also, assume that $D
> represents a parent directory where the rename occurs and that the
> original directory is named "foo". Finally, assume that F1 is the
> original directory's filehandle, and F2 is the new filehandle.
> Time increases as the graph goes down:
> 
> R1 T1 T2 S1
> ------------- ------------- ------------- ---------------
> LOOKUP "$D/foo"
> (1)
> REPLY (1) "foo" F1
> start reply
> processing
> up to
> extracting
> post-op attrs
> RENAME "$D/foo"
> "$D/baz" (2)
> REPLY (2)
> GETATTR $D
> during lookup
> due to expiry
> (3)
> REPLY (3)
> flush $D name
> cache entries
> due to updated
> timestamp
> LOOKUP "$D/baz"
> (4)
> REPLY (4) "baz" F1
> process reply,
> including
> post-op attrs
> that set F1's
> cached attrs
> to a ctime
> post RENAME
> 
> resume reply finish reply
> processing processing
> including including
> setting F1's setting F1's
> n_ctime and n_ctime and
> adding cache adding cache
> entry entry
> 
> At the end of this, the "victim" NFS client now has two name cache
> entries for "$D/foo" and "$D/baz" that point to the F1 filehandle.
> The n_ctime used to validate these name cache hits in nfs_lookup() is
> already updated to post RENAME, so nfs_lookup() will trust these
> entries until a future change to F1's i-node. Further, "$D"'s local
> attribute cache already reflects the updated ctime post RENAME, so it
> will not flush it's name cache entries until a future change to the
> directory.
> 
> The root problem is that the name cache entry for "foo" was added
> using the wrong ctime. It really should be using the F1 attributes in
> the post-op attributes from the LOOKUP reply, not from F1's local
> attribute cache. However, just changing that is not sufficient.
> There are still races with the calls to cache_enter() and updating
> n_ctime.
> 
> What I concluded is that it would really be far simpler and more
> obvious if the cached timestamps were stored in the namecache entry
> directly rather than having multiple name cache entries validated by
> shared state in the nfsnode. This does mean allowing the name cache
> to hold some filesystem-specific state. However, I felt this was much
> cleaner than adding a lot more complexity to nfs_lookup(). Also, this
> turns out to be fairly non-invasive to implement since nfs_lookup()
> calls cache_lookup() directly, but other filesystems only call it
> indirectly via vfs_cache_lookup(). I considered letting filesystems
> store a void * cookie in the name cache entry and having them provide
> a destructor, etc. However, that would require extra allocations for
> NFS lookups. Instead, I just adjusted the name cache API to
> explicitly allow the filesystem to store a single timestamp in a name
> cache entry by adding a new 'cache_enter_time()' that accepts a struct
> timespec that is copied into the entry. 'cache_enter_time()' also
> saves the current value of 'ticks' in the entry. 'cache_lookup()' is
> modified to add two new arguments used to return the timespec and
> ticks value used for a namecache entry when a hit in the cache occurs.
> 
> One wrinkle with this is that the name cache does not create actual
> entries for ".", and thus it would not store any timestamps for those
> lookups. To fix this I changed the NFS client to explicitly fast-path
> lookups of "." by always returning the current directory as setup by
> cache_lookup() and never bothering to do a LOOKUP or check for stale
> attributes in that case.
> 
> The current patch against 8 is at
> http://www.FreeBSD.org/~jhb/patches/nfs_lookup.patch
> 
> It includes ABI and API compat shims so that it is suitable for
> merging to stable branches. For HEAD I would likely retire the
> cache_lookup_times() name and just change all the callers of
> cache_lookup() (there are only a handful, and nwfs and smbfs might
> benefit from this functionality anyway).
>
It sounds good to me, although I haven`t yet looked at the patch
or thought about it much.

However, (and I think you`re already aware of this) given time clock
resolution etc, as soon as multiple clients start manipulating the
contents of a directory concurrently there is going to be a possibility
of having a stale name cache entry. I think you`ve already mentioned this,
but having a timeout on positive name cache entries like we did for
negative name cache entries, will at least limit the effect of these.

For negative name cache entries, the little test I did showed that name
cache hit was almost as good for a 30-60sec timeout as an infinite timeout.
I suspect something similar might be true for positive name cache entries
and it will be easy to do some measurements once it is coded.

If you would like, I can code up a positive name cache timeout similar
to what you did for the negative name cache entries or would you prefer
to do so?

rick