ESTALE after cwd deleted by same NFS client

Wed Dec 21 00:28:10 UTC 2016

Don Lewis wrote:

On 19 Dec, Rick Macklem wrote:
> Colin Percival wrote:
>
>>On 12/16/16 12:14, Colin Percival wrote:
>>> making this change in nfs_lookup
>>>> --- sys/fs/nfsclient/nfs_clvnops.c      (revision 310132)
>>>> +++ sys/fs/nfsclient/nfs_clvnops.c      (working copy)
>>>> @@ -1144,7 +1144,7 @@
>>>>                         *vpp = NULLVP;
>>>>                 }
>>>>
>>>> -               if (error != ENOENT) {
>>>> +               if (error != ENOENT && error != ESTALE) {
>>>>                         if (NFS_ISV4(dvp))
>>>>                                 error = nfscl_maperr(td, error, (uid_t)0,
>>>>                                     (gid_t)0);
>>>
>>> fixes the case I described above (for some definition of "fixes" -- I'm not
>>> sure if this is the correct way of handling ESTALE here?) but I'm still seeing
>>> ESTALEs from buildworld's cleandir so I think there must be some other place
>>> where something odd is happening.
>>
>>Further information: In addition to the "lookup relative to a directory which
>>has been deleted out from underneath us" case which causes ESTALE to land in
>>nfs_lookup, the cleandir step of buildworld results in ESTALE being returned
>>by nfsrpc_getattr into nfs_getattr (landing ultimately in getcwd), and ESTALE
>>being returned by nfsrpc_accessrpc into nfs34_access_otw (landing ultimately
>>in stat and lstat).
>>
>>In UFS there are checks for effnlink == 0 which result in e.g. ufs_lookup
>>returning ENOENT; would it make sense to add NREMOVED to struct nfsnode.n_flag
>>and check this in the appropriate nfs_* calls?
> To be honest, I can't think of a reason why userland would ever want to see ESTALE?
> The function you see above "nfscl_maperr()" could easily map all ESTALEs to ENOENTs?
> - The question is: "Would returning ENOENT for stat(2) and access(2) actually make the
>    buildworld happy?
>    if Yes
>           - then mapping ESTALE->ENOENT makes sense for most/all VOP_xxx() calls.
>    else
>            - I don't see any point in returning a different error and there might be some
>              code out there somewhere that depends on seeing ESTALE (I doubt it, but???).
>
> The real problem here is that the directory has a reference count on it when it is
> rmdir'd. POSIX file systems keep the data until the reference count goes to 0, but
> NFS isn't POSIX.
> --> The cheat for regular files is "sillyrename". This could be done for directories,
>       but there are multiple comments in the code (not put there by me) that say
>       "no sillyrename for directories".
>      #1 Does this imply something breaks when you do sillyrename for dirs?
>      OR
>       #2  Does it mean no one has bothered to implement it?
> Since implementing it would have been pretty easy, I have to suspect #1, which
> means I would be reluctant to do it, at least by default.
> --> Maybe I'll send you a patch that does sillyrename for dirs which you can test.
>       If it fixes buildworld, then it could be considered for head as a non-default option?
>
>Sillyrename makes sense for files because you can continue to read and
>write to a file after it is unlinked, until it is closed and it totally
>vanishes.  It doesn't make sense for directories since you can't rmdir()
>a directory until it is empty.  Once it is rmdir()ed, then you can't
>create any new entries in the directory because that would mean that you
>are creating a disconnnected subtree in the filesystem.
Good points. I just tried this ("rm -r" while another process's home dir is in the subtree)
and the process can still "ls" (gets no entries), but can't "cd .." because that doesn't
exist and no process can "cd" into the subtree.
--> I can't think of an easy way to make this happen for the NFS client.

My hunch is that the failure in "buildworld" will just have to be a "sorry, but an NFS
client isn't POSIX compliant, so this doesn't work" case.
Fyi, the only way to safely delete all entries in an NFS mounted dir is:
     opendir()
     while (readdir() != EOF) {
          unlink()  the first dir entry acquired by the readdir()
          closedir()
          opendir()
     }
So that it is always doing a readdir()/unlink() of the first entry in the directory
until the directory is empty.
This has always been the case, so I am surprised that "rm -r subdir" usually works.;-)
- This is because there is no guarantee that directory offset cookies won't change when
   an entry in the directory is removed.
   - It happens that UFS like file systems did have this property for a long time.
     (This changed for FreeBSD when r252438 was committed to head, which skipped
       over the d_ino == 0 entries when copying out directories. However, UFS still
       retains the property that the directory offset is monotonically increasing and
       the server can skip ahead to the next entry safely.)
- I have no idea whether or not ZFS directory offsets change when an entry is removed,
  but I do know that ZFS directory offsets are not monotonically increasing values.

The handling of directories has always been a weak spot for NFS, rick