Re: SEEK_HOLE at EOF

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Fri, 05 Apr 2024 14:23:20 UTC
On Fri, Apr 5, 2024 at 7:13 AM alan somers <asomers@gmail.com> wrote:
>
> On Fri, Apr 5, 2024 at 7:54 AM Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
> >
> > --------
> > Alan Somers writes:
> > > On Thu, Apr 4, 2024 at 11:43=E2=80=AFPM Poul-Henning Kamp <phk@phk.freebsd.=
> > > dk> wrote:
> >
> > > > Just two minor quibbles:
> > > >
> > > > If the file position is EOF, then you /are/ "beyond the end of the file"
> > > > because a read(2) would not be able to return any data.
> > >
> > > Do you distinguish between "at EOF" and "beyond EOF"?
As a bit of an aside, NFSv4.2 does differentiate between "at EOF"
and "beyond EOF" for its Seek operation.
The fun part is that Linux did not implement what is in the RFC and shipped
to many before the "bug" was noticed (and still do not conform to the RFC
afaik). As such, there are now two ways to do it, The RFC way or the Linux
way. Selecting between them is what the sysctl vfs.nfsd.linux42server does.

> > >  And does it not
> > > trouble you that calling SEEK_HOLE from the beginning of the "virtual
> > > hole at EOF" will return ENXIO, even though calling SEEK_HOLE from the
> > > beginning of any real hole will return the current offset?
> >
> > EOF is where the file ends and there's no "hole" there, because there
> > no more file on the other side of that "hole".
> >
> > When you stand on a cliff, the ocean is not "a hole in the landscape",
> > it's where the landscape ends.
>
> Except there is a hole at EOF, a virtual hole.  The draft spec
> specifically says "all seekable files shall have a virtual hole
> starting at the
> current size of the file".
I think that they used the term "virtual" to indicate this is not a real hole
and I think it was a good idea, since it allows file systems that do not
support holes to support SEEK_DATA.

However, I still believe that conforming to the Austin Group draft is
preferable.

rick

>
> >
> > > > And returning ENXIO is more informative than returning the size of the
> > > > file, since it atomically tells you that there are no more holes.
> > >
> > > Ahh, that's a good point.  It's the first point I've heard in favor of
> > > this option.  Are you aware of any applications that need to know
> > > that?
> >
> > No, but that should not get in the way of good syscall architecture :-)
> >
> > It might be useful for archivers which try to be smart about sparse files.
>
> I imagine that most archivers would work like this:
> ofs = 0
> loop {
>     let start = lseek(fd, ofs, SEEK_DATA);
>     if ENXIO {
>         // No more data regions
>         break
>     }
>     let end = lseek(fd, ofs, SEEK_HOLE);
>     assert!(!ENXIO) // thanks to the virtual hole, we should never
> have ENXIO here
>     copy(fd, start, end - start, ...)
>     ofs = end
> }
> truncate(output_file, fd.fsize)
>
> Since archivers really only care about data regions, not holes, I
> don't think that they would usually call SEEK_HOLE at EOF.
>
> >
> > --
> > Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> > phk@FreeBSD.ORG         | TCP/IP since RFC 956
> > FreeBSD committer       | BSD since 4.3-tahoe
> > Never attribute to malice what can adequately be explained by incompetence.
>