Re: RFC: Should copy_file_range(2) return after a few seconds?

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Mon, 10 Nov 2025 10:10:05 UTC
On Mon, Nov 10, 2025 at 12:58 AM Bakul Shah <bakul@iitbombay.org> wrote:
>
> On Nov 9, 2025, at 12:52 AM, Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Sat, Nov 8, 2025 at 11:14 PM Ronald Klop <ronald-lists@klop.ws> wrote:
> >>
> >> Why is this locking needed?
> >> AFAIK Unix has advisory locking, so if you read a file somebody else is writing the result is your own problem. It is up to the applications to adhere to the locking.
> >> Is this a lock different than file locking from user space?
> > Yes. A rangelock is used for a byte range during a read(2) or
> > write(2) to ensure that they are serialized.  This is a POSIX
> > requirement. (See this post by kib@ in the original email
> > discussion. https://lists.freebsd.org/archives/freebsd-fs/2025-October/004704.html)
> >
> > Since there is no POSIX standard for copy_file_range(), it could
> > be argued that range locking isn't required for copy_file_range(),
> > but that makes it inconsistent with read(2)/write(2) behaviour.
> > (I, personally, am more comfortable with a return after N sec
> > than removing the range locking, but that's just my opinion.)
>
> Traditionally reads/writes on Unix were atomic but that is not the
> case for NFS, right? That is, while I am reading a file over NFS
> someone else can modify it from another host (if they have write
> permission). That is, AFAIK, the POSIX atomicity requirement for
> ead / write is broken by NFS except for another reader/writer on
> the same host.
>
> Another issue is that a kernel lock that is held for a very very
> long time is asking for trouble. Ideally one spends as little time
> as possible in the supervisor state and any optimization hacks
> that push logic into the kernel should strive to not hold locks
> for very long so that things don't grind to a complete halt.
>
> That is, copy_file_range() use in cat(1) seems excessive. The only
> reason for its use seems to be for improving performance. Why not
> break it up in smaller chunks?
Btw, the time limit I proposed does break it up into smaller chunks.
The difference is that it specifies a chunk size that can be copied in
K seconds instead of a chunk size of N Kbytes. (The problem with
using N Kbytes is there is no way to know what N should be.)

rick

> That way you still get the benefit
> of reducing syscall overhead (which pales in comparision to any
> network reads in any case) + the same skipping over holes. Small
> reads/wries is what we did before  this syscall raised its head!