Race in NFS lookup can result in stale namecache entries

John Baldwin jhb at FreeBSD.org
Sat Jan 21 22:12:28 UTC 2012


On 1/21/12 3:12 AM, Kostik Belousov wrote:
> On Thu, Jan 19, 2012 at 11:17:28AM -0500, John Baldwin wrote:
>> On Thursday, January 19, 2012 11:01:56 am Kostik Belousov wrote:
>>> On Thu, Jan 19, 2012 at 10:26:09AM -0500, John Baldwin wrote:
>>>> On Thursday, January 19, 2012 9:06:13 am Kostik Belousov wrote:
>>>>> On Wed, Jan 18, 2012 at 05:07:21PM -0500, John Baldwin wrote:
>>>>> ...
>>>>>> What I concluded is that it would really be far simpler and more
>>>>>> obvious if the cached timestamps were stored in the namecache entry
>>>>>> directly rather than having multiple name cache entries validated by
>>>>>> shared state in the nfsnode.  This does mean allowing the name cache
>>>>>> to hold some filesystem-specific state.  However, I felt this was much
>>>>>> cleaner than adding a lot more complexity to nfs_lookup().  Also, this
>>>>>> turns out to be fairly non-invasive to implement since nfs_lookup()
>>>>>> calls cache_lookup() directly, but other filesystems only call it
>>>>>> indirectly via vfs_cache_lookup().  I considered letting filesystems
>>>>>> store a void * cookie in the name cache entry and having them provide
>>>>>> a destructor, etc.  However, that would require extra allocations for
>>>>>> NFS lookups.  Instead, I just adjusted the name cache API to
>>>>>> explicitly allow the filesystem to store a single timestamp in a name
>>>>>> cache entry by adding a new 'cache_enter_time()' that accepts a struct
>>>>>> timespec that is copied into the entry.  'cache_enter_time()' also
>>>>>> saves the current value of 'ticks' in the entry.  'cache_lookup()' is
>>>>>> modified to add two new arguments used to return the timespec and
>>>>>> ticks value used for a namecache entry when a hit in the cache occurs.
>>>>>>
>>>>>> One wrinkle with this is that the name cache does not create actual
>>>>>> entries for ".", and thus it would not store any timestamps for those
>>>>>> lookups.  To fix this I changed the NFS client to explicitly fast-path
>>>>>> lookups of "." by always returning the current directory as setup by
>>>>>> cache_lookup() and never bothering to do a LOOKUP or check for stale
>>>>>> attributes in that case.
>>>>>>
>>>>>> The current patch against 8 is at
>>>>>> http://www.FreeBSD.org/~jhb/patches/nfs_lookup.patch
>>>>> ...
>>>>>
>>>>> So now you add 8*2+4 bytes to each namecache entry on amd64 unconditionally.
>>>>> Current size of the struct namecache invariant part on amd64 is 72 bytes,
>>>>> so addition of 20 bytes looks slightly excessive. I am not sure about
>>>>> typical distribution of the namecache nc_name length, so it is unobvious
>>>>> does the change changes the memory usage significantly.
>>>>>
>>>>> A flag could be added to nc_flags to indicate the presence of timestamp.
>>>>> The timestamps would be conditionally placed after nc_nlen, we probably
>>>>> could use union to ease the access. Then, the direct dereferences of
>>>>> nc_name would need to be converted to some inline function.
>>>>>
>>>>> I can do this after your patch is committed, if you consider the memory
>>>>> usage saving worth it.
>>>>
>>>> Hmm, if the memory usage really is worrying then I could move to using the
>>>> void * cookie method instead.
>>>
>>> I think the current approach is better then cookie that again will be
>>> used only for NFS. With the cookie, you still has 8 bytes for each ncp.
>>> With union, you do not have the overhead for !NFS.
>>>
>>> Default setup allows for ~300000 vnodes on not too powerful amd64 machine,
>>> the ncsizefactor 2 together with 8 bytes for cookie is 4.5MB. For 20 bytes
>>> per ncp, we get 12MB overhead.
>>
>> Ok.  If you want to tackle the union bits I'm happy to let you do so.  That
>> will at least break up the changes a bit.
>
> Below is my take. First version of the patch added both small and large
> zones with ts, but later I decided that large does not make sense.
> If wanted, it can be restored easily.

This looks good to me.  I think you are fine with always using the _ts 
structure for the large case.

-- 
John Baldwin


More information about the freebsd-fs mailing list