Re: Sparse file support in FreeBSD NFSv4.2 server
Date: Mon, 12 May 2025 14:38:28 UTC
On Mon, May 12, 2025 at 2:40 AM Aurélien Couderc
<aurelien.couderc2002@gmail.com> wrote:
>
> On Fri, May 2, 2025 at 1:39 AM Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Thu, May 1, 2025 at 3:30 PM Aurélien Couderc
> > <aurelien.couderc2002@gmail.com> wrote:
> > >
> > > Could you please implement sparse file support in the FreeBSD NFS server?
> > >
> > > Typical users are databases, High Performance Computing, Big Data,
> > > distributed systems and so on.
> > >
> > > Required are:
> > > - NFSv4.2 operation "SEEK", to list sections with data, and sections with holes
> > It is already there.
>
> Thank you
>
> >
> > > - NFSv4.2 operation "ALLOCATE", to allocate disk space
> > Will never happen for ZFS because it is basically impossible. I am not a ZFS
> > guy, but that is what I have been told. UFS can do it, so it can be enabled if
> > all your exports are UFS file systems.
>
> Solaris has fnctl(F_ALLOCSP,...), so this should work on ZFS.
Well, I'm not a ZFS guy, but here is what I understood from the ZFS
folk w.r.t. this:
- When you write data to a file, new blocks are allocated for the data
bytes, even if
there is already old data written to those bytes. As such, it is
"impossible" to
guarantee that a write will not reply ENOSPACE/EQUOTA.
One responder did think it was possible, but listed several major changes
that would be required to make this possible on ZFS. (So "impossible" might
really be "too difficult to ever be implemented".)
- Right now you can "cheat" and set vfs.nfsd.enable_v42allocate=1, which
enables the ALLOCATE operation. The sysctl is meant to be used for
UFS exports only servers, but will "work, although technically broken"
for ZFS exports.
I have no idea what Solaris might be doing. (Note that OpenZFS is separate
from what Solaris is doing w.r.t. ZFS these days.)
>
> >
> > > - NFSv4.2 operation "DEALLOCATE", to free/deallocate disk space, aka
> > > punching a hole (which does not alter a file's size)
> > Is it already there.
>
> Merci
>
> >
> > > - NFSv4.2 operation "READPLUS", to read from sparse files. The
> > > returned data are a list of unions, each union either containing the
> > > block data, or the size of the hole
> > > - NFSv4.2 operation "WRITESAME", to fill (and thus allocate) files
> > > with repeated patterns (typical fill with zeros, or "empty pattern"
> > > for databases)
> > These are difficult to implement efficiently and, as such, do not provide
> > any real advantage over ordinary READ/WRITE. (The simpler case of
> > "writing zeros" can be done efficiently for some hardware, but that is
> > not what the NFSv4.2 RFC defines.)
>
> I disagree in the READ_PLUS case, as it benefits performance greatly,
> e.g. on a 2500baseT ethernet you can read zeros from holes at around
> 15000MBit :)
Not sure what client you used for these measurements, but here is what
the FreeBSD client currently does:
- A read(2) syscall calls VOP_READ(). The NFS client VOP_READ()
allocates buffer cache block(s), each of a maximum of vfs.maxbcachebuf
bytes (default 64K, can be increased to 1Mbyte). It then issues a READ RPC
for the 64K->1Mbyte.
Doing a whole bunch of 64K READ_PLUS operations is not going to help much.
--> This can be avoided by opening with O_DIRECT, but what applications
will want to do that? (There is also the limit on RPC reply
size, which is
currently set at a little over 1Mbyte.)
The application can use SEEK_DATA/SEEK_HOLE to find the data and then
do read(2) syscalls to read the data areas, avoiding reading of lots of bytes
or zeros.
To do it another way would require a new read_plus(2) syscall, which is
not done by any OS that I am aware of (and not spec'd by POSIX as far
as I am aware).
>
> But more seriously, it gives applications and the operating system the
> "hint" that there is no data, and can act accordingly.
Unfortunately such a hint does not exist in FreeBSD. Comparing st_size
with st_blocks (va_size with va_bytes in the kernel struct vattr) only works
when files are not being compressed. (ZFS often does compression.)
Such a hint would. I think, be useful, but I have not explored how this
might be done in FreeBSD (or for ZFS to be more specific).
>
> > --> The only advantage they provide without hardware assist is fewer
> > bytes on-the-wire, but modern networks have lots of bandwidth and
> > the overhead of doing it in software nullifies this advantage, imho.
>
> No, callers of READ_PLUS can also act on the information that there is
> no data in this place.
>
> >
> > There was a discussion on linux-nfs@vger.kernel.org which came to
> > essentially the same conclusion, although they done some implementation
> > of READPLUS anyhow.
>
> Sigh, and the Linux people again came to a wrong conclusion, and now
> stick to their "opinion". But that does not make them "right".
See above. I think you'll find that the Linux client is very similar, except it
uses pages instead of buffer cache blocks.
rick
>
> Aurélien