getdirentries cookies usage outside of UFS

Mon Apr 14 14:26:43 UTC 2014

Bryan Drewery wrote:
> Recently I was working on a compat syscall for sys_getdirentries() that
> converts between our dirent and the FreeBSD dirent struct. We had never
> tried using this on TMPFS and when we did ran into weird issues (hence
> my recent commits to TMPFS to clarify some of the getdirentries() code).
> We were not using cookies, so I referenced the Linux compat module
> (linux_file.c getdents_common())
>
> I ran across this code:
> >     /*
> >      * When using cookies, the vfs has the option of reading from
> >      * a different offset than that supplied (UFS truncates the
> >      * offset to a block boundary to make sure that it never reads
> >      * partway through a directory entry, even if the directory
> >      * has been compacted).
> >      */
> >     while (len > 0 && ncookies > 0 && *cookiep <= off) {
> >             bdp =3D (struct dirent *) inp;
> >             len -=3D bdp->d_reclen;
> >             inp +=3D bdp->d_reclen;
> >             cookiep++;
> >             ncookies--;
> >     }
>
>
> At first it looked innocuous but then it occurred to me it was the root
> of the issue I was having as it was eating my cookies based on their
> value, despite tmpfs cookies being random hash values that have no
> sequential relation. So I looked at how NFS was handling the same code
> and found this lovely hack from r216691:
>
> > not_zfs =3D strcmp(vp->v_mount->mnt_vfc->vfc_name, "zfs");
> >...
> >  while (cpos < cend && ncookies > 0 &&
> >      (dp->d_fileno =3D=3D 0 || dp->d_type =3D=3D DT_WHT ||
> >       (not_zfs !=3D 0 && ((u_quad_t)(*cookiep)) <=3D toff))) {
> >          cpos +=3D dp->d_reclen;
> >          dp =3D (struct dirent *)cpos;
> >          cookiep++;
> >          ncookies--;
> >  }
>
> I ended up doing the opposite, only running the code if getting dirents
> from "ufs".
>
> So there's multiple issue here.
>
> 1. NFS is broken on TMPFS. I can see why it's gone so long unnoticed,
> why would you do that? Still probably worth fixing.
>
Well, since exporting a volatile file system over NFS definitely breaks
the protocol, the correct fix should probably be to disable TMPFS so that
it cannot be exported at all. However, someone probably likes to do this
anyhow, so I guess it should be left as is or fixed...

> 2. Linux and SVR4 getdirentries() are both broken on TMPFS/ZFS. I am
> surprised Linux+ZFS has not been noticed by now. I am aware the SVR4 is
> full of other bugs too. I ran across many just reviewing the
> getdirentries code alone.
>
> Do any other file systems besides UFS do this offset/cookie
> truncation/rewind? If UFS is the only one it may be acceptable to change
> this zfs check to !ufs and add it to the other modules. If we don't like
> that, or there are potentially other file systems doing this too, how
> about adding a flag to somewhere to indicate the file system has
> monotonically increasing offsets and needs this rewind support. I'm not
> sure where that is best done, struct vfsconf?
>
At a glance, I don't think UFS returns directory entries starting at a block
boundary any more, either. (It reads from a block boundary, but skips the
ones before the startoffset, if I read ufs_readdir() correctly.)
As such, I'm not sure this is needed for UFS?

However, it is going to be difficult to determine if this loop is no
longer necessary for any file system.

If if can't be determined that this is no longer necessary at all, a
flag indicating a file system uses non-monotonically increasing directory
offsets seems the cleanest fix to me.
However, this change might be difficult to MFC. If is was a "non-monotonically
increasing dir offset" flag that is set, old code would still function the
same without knowledge of the flag, so long as the NFS server case was
changed to "not_zfs and flag no set" and the "not_zfs" test was still done.
(Of course, then the "not_zfs" case will probably be in the sources for decades
 to come, because no one can be sure it is safe to get rid of;-)

It would be nice if it could be determined if any file system still backs
up to a block boundary, but I can't answer that.

I've cc'd Kirk and Kostik, in case either of them know or have opinions on this?

rick
ps: You might also take a look for uio_offset < 0 tests, since a -ve offset
    used to be meaningful. (Sorry, I can't remember when/how this was used, but
    if the high order bit of a directory offset cookie is set, there might be
    surprises;-) Maybe this isn't a problem for dirs, if I recall it correctly?

> --=20
> Regards,
> Bryan Drewery