Re: RFC: Should copy_file_range(2) return after a few seconds?

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Mon, 10 Nov 2025 09:09:36 UTC
On Mon, Nov 10, 2025 at 12:58 AM Bakul Shah <bakul@iitbombay.org> wrote:
>
> On Nov 9, 2025, at 12:52 AM, Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Sat, Nov 8, 2025 at 11:14 PM Ronald Klop <ronald-lists@klop.ws> wrote:
> >>
> >> Why is this locking needed?
> >> AFAIK Unix has advisory locking, so if you read a file somebody else is writing the result is your own problem. It is up to the applications to adhere to the locking.
> >> Is this a lock different than file locking from user space?
> > Yes. A rangelock is used for a byte range during a read(2) or
> > write(2) to ensure that they are serialized.  This is a POSIX
> > requirement. (See this post by kib@ in the original email
> > discussion. https://lists.freebsd.org/archives/freebsd-fs/2025-October/004704.html)
> >
> > Since there is no POSIX standard for copy_file_range(), it could
> > be argued that range locking isn't required for copy_file_range(),
> > but that makes it inconsistent with read(2)/write(2) behaviour.
> > (I, personally, am more comfortable with a return after N sec
> > than removing the range locking, but that's just my opinion.)
>
> Traditionally reads/writes on Unix were atomic but that is not the
> case for NFS, right? That is, while I am reading a file over NFS
> someone else can modify it from another host (if they have write
> permission). That is, AFAIK, the POSIX atomicity requirement for
> ead / write is broken by NFS except for another reader/writer on
> the same host.
Yes. NFS is not a POSIX compliant file system (and cannot be,
given various aspects of the protocol).
The client can only attempt to approximate POSIX semantics.

>
> Another issue is that a kernel lock that is held for a very very
> long time is asking for trouble. Ideally one spends as little time
> as possible in the supervisor state and any optimization hacks
> that push logic into the kernel should strive to not hold locks
> for very long so that things don't grind to a complete halt.
>
> That is, copy_file_range() use in cat(1) seems excessive. The only
> reason for its use seems to be for improving performance. Why not
> break it up in smaller chunks?
As I noted, for ZFS with block cloning enabled (now the default), the
entire copy happens quickly no matter how large the file, since it
does a copy on write.

To break it up into smaller chunks defeats a lot of this, although
it still might be able to do block cloing so long as the offsets and
lengths are exact multiples of the recordsize (default of 128K these
days, but can vary per file, so an application cannot easily know
what the correct read/write size is to make it work).

I did not do the "cat" commit, but I think it is a good idea, rick

> That way you still get the benefit
> of reducing syscall overhead (which pales in comparision to any
> network reads in any case) + the same skipping over holes. Small
> reads/wries is what we did before  this syscall raised its head!