seekdir/readdir patch .. Call for Review.

Julian Elischer julian at freebsd.org
Wed May 6 03:54:58 UTC 2015


On 5/6/15 7:33 AM, Rick Macklem wrote:
> Julian Elischer wrote:
>> On 5/3/15 10:33 PM, Jilles Tjoelker wrote:
>>> On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov
>>> wrote:
>>>> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote:
>>>>> if you are interested in readdir(3), seekdir(3) and telldir(3)
>>>>> then
>>>>> you should look at
>>>>>      https://reviews.freebsd.org/D2410
>>>>> this patches around a problem in seekdir() that breaks Samba.
>>>>> Seekdir(3) will not work as expected when files prior to the
>>>>> point of
>>>>> interest in directory have been deleted since the directory was
>>>>> opened.
>>>>> Windows clients using Samba cause both these things to happen,
>>>>> causing
>>>>> the next readdir(3) after the bad seekdir(3) to skip some entries
>>>>> and
>>>>> return the wrong file.
>>>>> Samba only needs to step back a single directory entry in the
>>>>> case
>>>>> where it reads an entry and then discovers it can't fit it into
>>>>> the
>>>>> buffer it is sending to the windows client. It turns out we can
>>>>> reliably cater to Samba's requirement because the "last returned
>>>>> element" is always still in memory, so with a little care, we can
>>>>> set our filepointer back to it safely. (once)
>>>>> seekdir and readdir (and telldir()) need a complete rewrite along
>>>>> with
>>>>> getdirentries() but that is more than a small edit like this.
>>>> Can you explain your expectations from the whole readdir() vs.
>>>> parallel
>>>> directory modifications interaction ?  From what I understood so
>>>> far,
>>>> there is unlocked modification of the container and parallel
>>>> iterator
>>>> over the same container.  IMO, in such situation, whatever tweaks
>>>> you
>>>> apply to the iterator, it is still cannot be made reliable.
>>>> Before making single-purpose changes to the libc readdir and
>>>> seekdir
>>>> code, or to the kernel code, it would be useful to state exact
>>>> behaviour
>>>> of the dirent machinery we want to see. No, 'make samba works in
>>>> my
>>>> situation' does not sound good enough.
>>> Consider the subsequence of entries that existed at opendir() time
>>> and
>>> were not removed until now. This subsequence is clearly defined and
>>> does
>>> not have concurrency problems. The order of this subsequence must
>>> remain
>>> unchanged and seekdir() must be correct with respect to this
>>> subsequence.
>>>
>>> Additionally, two other kinds of entries may be returned. New
>>> entries
>>> may be inserted anywhere in between the entries of the subsequence,
>>> and
>>> removed entries may be returned as if they were still part of the
>>> subsequence (so that not every readdir() needs a system call).
>>>
>>> A simple implementation for UFS-style directories is to store the
>>> offset
>>> in the directory (all bits of it, not masking off the lower 9
>>> bits).
>>> This needs d_off or similar in struct dirent. The kernel
>>> getdirentries()
>>> then needs a similar loop as the old libc seekdir() to go from the
>>> start
>>> of the 512-byte directory block to the desired entry (since an
>>> entry may
>>> not exist at the stored offset within the directory block).
>> At least it needs some more information in struct dirent than there
>> is
>> now..
>> A cookie is the current fashion..   that assumes however that the
>> filesystems
>> are capable of converting backwards from cookie to 'location'. ZFS
>> claims to
>> be able to do so..
> My current plan for a patch is...
> - d_cookie would be the "physical" file system position of the next
>    directory entry
> - a ngetdirentries() would take a "physical" cookie as a value/result
>    argument. It would indicate where to start and would return the cookie
>    for the next entry after what is returned. (It could probably be stuffed
>    in uio_offset, but I think it might be clearer to make it a separate arg.)
>    --> It would pass this physical cookie down to the file system's
>        VOP_READDIR(). (For UFS it would be the byte offset in the on-disk
>        directory. For ZFS, I believe it is an index for the entry. For
>        NFS, it is the cookie that is sent to the server. For others, I
>        don't yet know.)
> - dd_seek, dd_loc, loc_seek and loc_loc would be replaced by dd_cookie and
>    loc_cookie. (For arches where sizeof(long) == 8, I think telldir() could
>    just return the cookie and forget about the loc_XX structures?)
> This would get rid of the loc_seek, loc_loc bogosity that no longer makes
> much sense, since the byte offset in what is returned by getdirentries()
> has little to do with the "physical" directory position.
>
> I have already done the kernel stuff for some file systems and the libc
> changes actually simplify things compared to what is there now.

zfs already has a cookie production facility as part of the VFS 
readdir (or is it  dir-read?)
method.

>
> rick
>
>> Thee other thing to do would be to store some sort
>> of strong
>> hash of the name and inode# in each telldir entry..
>> we would seek to the saved seek location and seek forward computing
>> or
>> looking
>> for a matching hash. I woudl also add that the man pages talk about
>> the
>> readdir blocksize a bit and mention the file blocksize (and stat)
>> which is often
>> way dfferent from 512 bytes.. usually 16k or so these days.
>> I found setting the read size to the same as the fs blocksize seemd
>> to
>> work well.
>>> This means that a UFS-style directory cannot be compacted (existing
>>> entries moved from higher to lower offsets to fill holes) while it
>>> is
>>> open for reading. An NFS exported directory is always open for
>>> reading.
>> yes so a UFS filesystem that is exported could never do garbage
>> collection.
>>> This also means that duplicate entries can only be returned if that
>>> particular filename was deleted and created again.
>>>
>>> Without kernel support, it is hard to get telldir/seekdir
>>> completely
>>> reliable. The current libc implementation is wrong since the
>>> "holes"
>>> within the block just disappear and change the offsets of the
>>> following
>>> entries; the kernel cannot fix this using entries with d_fileno = 0
>>> since it cannot know, in the general case, how long the deleted
>>> entry
>>> was in the filesystem-independent dirent format.
>> yes it's the filesystem that knows.. we USED to return empty entries
>> in the dirent list but that was removed recently think.
>>>    My previous idea of
>>> storing one d_fileno during telldir() is wrong since it will fail
>>> if
>>> that entry is deleted.
>>>
>>> If you do not care about memory usage (which probably is already
>>> excessive with the current libc implementation), you could store at
>>> telldir() time the offset of the current block returned by
>>> getdirentries() and the d_fileno of all entries already returned in
>>> the
>>> current block.
>>>
>>> The D2410 patch can conceptually work for what Samba needs,
>>> stepping
>>> back one directory entry. I will comment on it.
>>>
>> _______________________________________________
>> freebsd-current at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to
>> "freebsd-current-unsubscribe at freebsd.org"
>>
>



More information about the freebsd-current mailing list