open() and ESTALE error

Sun Jun 29 18:23:40 PDT 2003

On 22 Jun, Andrey Alekseyev wrote:
> Don,
> 
>> When a file is renamed on the server, its file handle remains valid.
> 
> Actually I was wrong in my assumption on how names are purged from the
> namecache. And I didn't mean an operation with a file opened on the client.
> And it actually happens that this sequence of commands will get you ENOENT
> (not ESTALE) on the client because of a new lookup in #4:
> 
> 1. server: echo "1111" > 1
> 2. client: cat 1
> 3. server: mv 1 2
> 4. client: cat 1  <--- ENOENT here

That's what it is supposed to do, but my testing would seem to indicate
that step 4 could return the file contents for an extended period of
time after the file was renamed on the server.

> Name cache can be purged by nfs_lookup(), if the latter finds that the
> capability numbers doesn't match. In this case, nfs_lookup() will send a
> new "lookup" RPC request to the server. Name cache can also be purged from
> getnewvnode() and vclean(). Which code does that for the above scenario
> it's quite obscure to me. Yes, my knowledge is limited :)

The vpid == newvp->v_id test in nfs_lookup() just detects if the vnode
that the cache entry pointed to was recycled for another use while it
was on the free list.  It doesn't detect whether the inode on the server
was recycled.

When I was thinking about this problem, the solution I came up with was
a lot like the
	if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td)
                            && vattr.va_ctime.tv_sec == VTONFS(newvp)->n_ctime)
code fragment, but I would have done the ctime check on both the target
and the parent directory and only ignored the cache entry if both ctimes
had been updated.  Checking only the target should be more conservative,
though it would be slower because there would be more cases where the
client would have to do the RPC call.

If the file on the server associated with the cached entry on the client
is renamed on the server, its file handle will remain valid, but its
ctime will be updated, so VOP_GETATTR() will succeed, but the ctime
check should be activated and the cache entry purged.

If the file on the server is unlinked or another file mv'ed on top of
it, its file handle should no longer be valid, so the VOP_GETATTR() call
should fail, which should cause the cache entry to be purged and a new
lookup RPC should be done.

What I find interesting is that in order for for open() to fail with the
ESTALE error, the cache entry must be used, which means that this
VOP_GETATTR() call must be succeeding, but for some reason another VOP
call after namei() returns is failing with ESTALE.

>> Here's the output of the script:
>> 
>> #!/bin/sh -v
>> rm -f file1 file2
>> ssh -n mousie rm -f file1 file2
>> echo foo > file1
>> echo bar > file2
>> ssh -n mousie cat file1
>> foo
>> ssh -n mousie cat file2
>> bar
>> tail -f file1 &
>> sleep 1
>> foo
>> cat file1
>> foo
>> cat file2
>> bar
>> ssh -n mousie 'mv file1 tmpfile; mv file2 file1; mv tmpfile file2'
>> cat file1
>> bar
>> cat file2
>> foo
>> echo baz >> file2
>> sleep 1
>> baz
>> kill $!
>> Terminated
>> ssh -n mousie cat file1
>> bar
>> ssh -n mousie cat file2
>> foo
>> baz
>> 
>> Notice that immediately after the files are swapped on the server, the
>> cat commands on the client are able to immediately detect that the files
>> have been interchanged and they open the correct files.  The tail
>> command shows that the original handle for file1 remains valid after the
>> rename operations and when more data is written to file2 after the
>> interchange, the data is appended to the file that was formerly file1.
> 
> By the way, what were the values of acregmin/acregmax/acdirmin/acdirmax and
> also the value of vfs.nfs.access_cache_timeout in your tests?

I'm using the the default values for
acregmin/acregmax/acdirmin/acdirmax.

% sysctl vfs.nfs.access_cache_timeout 
vfs.nfs.access_cache_timeout: 2

> I believe, the results of your test patterns heavily depend on the NFS
> attributes cache tunables (which happen to affect all cycles of NFS
> operation) and on the command execution timing as well. Moreover, I'm
> suspect that all this is badly linked with the type and sequence of
> operations on both the server and the client. Recall, I was about to fix
> just *one* common scenario :)

Some of my test cases waited for 120 seconds after the rename on the
server before attempting access from the client, which should be enough
time for the attribute cache to time out.

> With different values of acmin/acmax and access_cache_timeout, and manual
> operations I was able to achieve the result you consider as "proper" above
> and also, the "wrong" effect that you described below.
> 
>> And its output:
>> 
>> #!/bin/sh -v
>> rm -f file1 file2
>> ssh -n mousie rm -f file1 file2
>> echo foo > file1
>> echo bar > file2
>> ssh -n mousie cat file1
>> foo
>> ssh -n mousie cat file2
>> bar
>> sleep 1
>> cat file1
>> foo
>> cat file2
>> bar
>> ssh -n mousie 'mv file1 file2'
>> cat file2
>> foo
>> cat file1
>> cat: file1: No such file or directory
>> 
>> Even though file2 was unlinked and replaced by file1 on the server, the
>> client immediately notices the change and is able to open the proper
>> file.
> 
> My tests always eventually produce ESTALE for file2 here. However, I suspect
> their must be configurations where I won't get ESTALE.
> 
>> Conclusion: relying on seeing an ESTALE error to retry is insufficient.
>> Depending on how files are manipulated, open() may successfully return a
>> descriptor for the wrong file and even enable the contents of that file
>> to be overwritten.  The namei()/lookup() code is broken and that's what
>> needs to be fixed.
> 
> I don't think it's namei()/lookup() which is broken. I'm afraid, the name
> and attribute caching logic is somewhat far from ideal. Namecache routines
> seem to work fine, they just do actual parsing/lookup of a pathname. Other
> functions manipulate with the cached names basing on their understanding
> of the cache validity (both namecache and cached dir/file attributes).

I think the main problem is namei()/lookup().  They shouldn't be
returning a vnode that is associated with a file handle that points to a
different or non-existent file on the server if the name to handle
association has been invalid for a long period of time.  While it's not
possible to totally enforce coherency, we should be able to do a lot
better.

> I've also done a number of tcpdump's for different test patterns and I
> believe, what happens with the cached vnode may depend on the results of
> the "access" RPC request to the server.

That may be an important clue.  The access cache may be properly
working, but the attribute cache timeout may be broken.

> As I said before, I was not going to fix all the NFS inefficiencies related
> to heavily shared file environments. However, I still believe that
> open-retry-on-ESTALE *may* help people to avoid certain erratic conditions.
> At least, I think that having this functionality switchable with an
> additional sysctl variable, *may* help lots of people in the black art of
> tuning NFS <attribute> caching. As there are no exact descriptions on how
> all of this behaves, people usually have to experiment with their own
> certain environments.
> 
> Also, I agree it's not the "fix" of everything. And I didn't even mention
> I want this to be integrated in the source :)

I don't think we can totally fix the problem, but I would like to see
the source fixed so that people don't have to patch their applications
or their kernel for common usage patterns.

> Actually, I know that it works for what I've been fixing locally and just
> asked for technical comments about possible "vnode leakage" and nameidata
> initialization which nobody provided yet ;-P

I think you're probably ok on the vnode side, but one problem might be
the flags in the struct nameidata.  The lookup code tends to fiddle with
them.  I was also concerned about leaking the cn_pnbuf buffer, but it
looks like it may not get allocated or may get freed in the error case,
since kern_open() don't call NDFREE(&nd, NDF_ONLY_PNBUF) if namei()
fails.

> I appreciate *very much* all of the answers, though. Definitely, a food for
> thought, but I'm a little bit tired of this issue already :)
> 
> Thanks again for your efforts.