open() and ESTALE error

Sun Jun 22 04:51:23 PDT 2003

Don,

> When a file is renamed on the server, its file handle remains valid.

Actually I was wrong in my assumption on how names are purged from the
namecache. And I didn't mean an operation with a file opened on the client.
And it actually happens that this sequence of commands will get you ENOENT
(not ESTALE) on the client because of a new lookup in #4:

1. server: echo "1111" > 1
2. client: cat 1
3. server: mv 1 2
4. client: cat 1  <--- ENOENT here

Name cache can be purged by nfs_lookup(), if the latter finds that the
capability numbers doesn't match. In this case, nfs_lookup() will send a
new "lookup" RPC request to the server. Name cache can also be purged from
getnewvnode() and vclean(). Which code does that for the above scenario
it's quite obscure to me. Yes, my knowledge is limited :)

> Here's the output of the script:
> 
> #!/bin/sh -v
> rm -f file1 file2
> ssh -n mousie rm -f file1 file2
> echo foo > file1
> echo bar > file2
> ssh -n mousie cat file1
> foo
> ssh -n mousie cat file2
> bar
> tail -f file1 &
> sleep 1
> foo
> cat file1
> foo
> cat file2
> bar
> ssh -n mousie 'mv file1 tmpfile; mv file2 file1; mv tmpfile file2'
> cat file1
> bar
> cat file2
> foo
> echo baz >> file2
> sleep 1
> baz
> kill $!
> Terminated
> ssh -n mousie cat file1
> bar
> ssh -n mousie cat file2
> foo
> baz
> 
> Notice that immediately after the files are swapped on the server, the
> cat commands on the client are able to immediately detect that the files
> have been interchanged and they open the correct files.  The tail
> command shows that the original handle for file1 remains valid after the
> rename operations and when more data is written to file2 after the
> interchange, the data is appended to the file that was formerly file1.

By the way, what were the values of acregmin/acregmax/acdirmin/acdirmax and
also the value of vfs.nfs.access_cache_timeout in your tests?

I believe, the results of your test patterns heavily depend on the NFS
attributes cache tunables (which happen to affect all cycles of NFS
operation) and on the command execution timing as well. Moreover, I'm
suspect that all this is badly linked with the type and sequence of
operations on both the server and the client. Recall, I was about to fix
just *one* common scenario :)

With different values of acmin/acmax and access_cache_timeout, and manual
operations I was able to achieve the result you consider as "proper" above
and also, the "wrong" effect that you described below.

> And its output:
> 
> #!/bin/sh -v
> rm -f file1 file2
> ssh -n mousie rm -f file1 file2
> echo foo > file1
> echo bar > file2
> ssh -n mousie cat file1
> foo
> ssh -n mousie cat file2
> bar
> sleep 1
> cat file1
> foo
> cat file2
> bar
> ssh -n mousie 'mv file1 file2'
> cat file2
> foo
> cat file1
> cat: file1: No such file or directory
> 
> Even though file2 was unlinked and replaced by file1 on the server, the
> client immediately notices the change and is able to open the proper
> file.

My tests always eventually produce ESTALE for file2 here. However, I suspect
their must be configurations where I won't get ESTALE.

> Conclusion: relying on seeing an ESTALE error to retry is insufficient.
> Depending on how files are manipulated, open() may successfully return a
> descriptor for the wrong file and even enable the contents of that file
> to be overwritten.  The namei()/lookup() code is broken and that's what
> needs to be fixed.

I don't think it's namei()/lookup() which is broken. I'm afraid, the name
and attribute caching logic is somewhat far from ideal. Namecache routines
seem to work fine, they just do actual parsing/lookup of a pathname. Other
functions manipulate with the cached names basing on their understanding
of the cache validity (both namecache and cached dir/file attributes).

I've also done a number of tcpdump's for different test patterns and I
believe, what happens with the cached vnode may depend on the results of
the "access" RPC request to the server.

As I said before, I was not going to fix all the NFS inefficiencies related
to heavily shared file environments. However, I still believe that
open-retry-on-ESTALE *may* help people to avoid certain erratic conditions.
At least, I think that having this functionality switchable with an
additional sysctl variable, *may* help lots of people in the black art of
tuning NFS <attribute> caching. As there are no exact descriptions on how
all of this behaves, people usually have to experiment with their own
certain environments.

Also, I agree it's not the "fix" of everything. And I didn't even mention
I want this to be integrated in the source :)

Actually, I know that it works for what I've been fixing locally and just
asked for technical comments about possible "vnode leakage" and nameidata
initialization which nobody provided yet ;-P

I appreciate *very much* all of the answers, though. Definitely, a food for
thought, but I'm a little bit tired of this issue already :)

Thanks again for your efforts.