seekdir/readdir patch .. Call for Review.

Rick Macklem rmacklem at uoguelph.ca
Tue May 5 23:33:30 UTC 2015


Julian Elischer wrote:
> On 5/3/15 10:33 PM, Jilles Tjoelker wrote:
> > On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov
> > wrote:
> >> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote:
> >>> if you are interested in readdir(3), seekdir(3) and telldir(3)
> >>> then
> >>> you should look at
> >>>     https://reviews.freebsd.org/D2410
> >>> this patches around a problem in seekdir() that breaks Samba.
> >>> Seekdir(3) will not work as expected when files prior to the
> >>> point of
> >>> interest in directory have been deleted since the directory was
> >>> opened.
> >>> Windows clients using Samba cause both these things to happen,
> >>> causing
> >>> the next readdir(3) after the bad seekdir(3) to skip some entries
> >>> and
> >>> return the wrong file.
> >>> Samba only needs to step back a single directory entry in the
> >>> case
> >>> where it reads an entry and then discovers it can't fit it into
> >>> the
> >>> buffer it is sending to the windows client. It turns out we can
> >>> reliably cater to Samba's requirement because the "last returned
> >>> element" is always still in memory, so with a little care, we can
> >>> set our filepointer back to it safely. (once)
> >>> seekdir and readdir (and telldir()) need a complete rewrite along
> >>> with
> >>> getdirentries() but that is more than a small edit like this.
> >> Can you explain your expectations from the whole readdir() vs.
> >> parallel
> >> directory modifications interaction ?  From what I understood so
> >> far,
> >> there is unlocked modification of the container and parallel
> >> iterator
> >> over the same container.  IMO, in such situation, whatever tweaks
> >> you
> >> apply to the iterator, it is still cannot be made reliable.
> >> Before making single-purpose changes to the libc readdir and
> >> seekdir
> >> code, or to the kernel code, it would be useful to state exact
> >> behaviour
> >> of the dirent machinery we want to see. No, 'make samba works in
> >> my
> >> situation' does not sound good enough.
> > Consider the subsequence of entries that existed at opendir() time
> > and
> > were not removed until now. This subsequence is clearly defined and
> > does
> > not have concurrency problems. The order of this subsequence must
> > remain
> > unchanged and seekdir() must be correct with respect to this
> > subsequence.
> >
> > Additionally, two other kinds of entries may be returned. New
> > entries
> > may be inserted anywhere in between the entries of the subsequence,
> > and
> > removed entries may be returned as if they were still part of the
> > subsequence (so that not every readdir() needs a system call).
> >
> > A simple implementation for UFS-style directories is to store the
> > offset
> > in the directory (all bits of it, not masking off the lower 9
> > bits).
> > This needs d_off or similar in struct dirent. The kernel
> > getdirentries()
> > then needs a similar loop as the old libc seekdir() to go from the
> > start
> > of the 512-byte directory block to the desired entry (since an
> > entry may
> > not exist at the stored offset within the directory block).
> At least it needs some more information in struct dirent than there
> is
> now..
> A cookie is the current fashion..   that assumes however that the
> filesystems
> are capable of converting backwards from cookie to 'location'. ZFS
> claims to
> be able to do so..
My current plan for a patch is...
- d_cookie would be the "physical" file system position of the next
  directory entry
- a ngetdirentries() would take a "physical" cookie as a value/result
  argument. It would indicate where to start and would return the cookie
  for the next entry after what is returned. (It could probably be stuffed
  in uio_offset, but I think it might be clearer to make it a separate arg.)
  --> It would pass this physical cookie down to the file system's
      VOP_READDIR(). (For UFS it would be the byte offset in the on-disk
      directory. For ZFS, I believe it is an index for the entry. For
      NFS, it is the cookie that is sent to the server. For others, I
      don't yet know.)
- dd_seek, dd_loc, loc_seek and loc_loc would be replaced by dd_cookie and
  loc_cookie. (For arches where sizeof(long) == 8, I think telldir() could
  just return the cookie and forget about the loc_XX structures?)
This would get rid of the loc_seek, loc_loc bogosity that no longer makes
much sense, since the byte offset in what is returned by getdirentries()
has little to do with the "physical" directory position.

I have already done the kernel stuff for some file systems and the libc
changes actually simplify things compared to what is there now.

rick

> Thee other thing to do would be to store some sort
> of strong
> hash of the name and inode# in each telldir entry..
> we would seek to the saved seek location and seek forward computing
> or
> looking
> for a matching hash. I woudl also add that the man pages talk about
> the
> readdir blocksize a bit and mention the file blocksize (and stat)
> which is often
> way dfferent from 512 bytes.. usually 16k or so these days.
> I found setting the read size to the same as the fs blocksize seemd
> to
> work well.
> >
> > This means that a UFS-style directory cannot be compacted (existing
> > entries moved from higher to lower offsets to fill holes) while it
> > is
> > open for reading. An NFS exported directory is always open for
> > reading.
> yes so a UFS filesystem that is exported could never do garbage
> collection.
> >
> > This also means that duplicate entries can only be returned if that
> > particular filename was deleted and created again.
> >
> > Without kernel support, it is hard to get telldir/seekdir
> > completely
> > reliable. The current libc implementation is wrong since the
> > "holes"
> > within the block just disappear and change the offsets of the
> > following
> > entries; the kernel cannot fix this using entries with d_fileno = 0
> > since it cannot know, in the general case, how long the deleted
> > entry
> > was in the filesystem-independent dirent format.
> 
> yes it's the filesystem that knows.. we USED to return empty entries
> in the dirent list but that was removed recently think.
> >   My previous idea of
> > storing one d_fileno during telldir() is wrong since it will fail
> > if
> > that entry is deleted.
> >
> > If you do not care about memory usage (which probably is already
> > excessive with the current libc implementation), you could store at
> > telldir() time the offset of the current block returned by
> > getdirentries() and the d_fileno of all entries already returned in
> > the
> > current block.
> >
> > The D2410 patch can conceptually work for what Samba needs,
> > stepping
> > back one directory entry. I will comment on it.
> >
> 
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscribe at freebsd.org"
> 


More information about the freebsd-current mailing list