Re: RFC: Should copy_file_range(2) return after a few seconds?

From: Peter 'PMc' Much <pmc_at_citylink.dinoex.sub.org>
Date: Tue, 11 Nov 2025 14:50:27 UTC
On Mon, Nov 10, 2025 at 12:58:17AM -0800, Bakul Shah wrote:
! On Nov 9, 2025, at 12:52 AM, Rick Macklem <rick.macklem@gmail.com> wrote:
! > 
! > On Sat, Nov 8, 2025 at 11:14 PM Ronald Klop <ronald-lists@klop.ws> wrote:
! >> 
! >> Why is this locking needed?
! >> AFAIK Unix has advisory locking, so if you read a file somebody else is writing the result is your own problem. It is up to the applications to adhere to the locking.
! >> Is this a lock different than file locking from user space?
! > Yes. A rangelock is used for a byte range during a read(2) or
! > write(2) to ensure that they are serialized.  This is a POSIX
! > requirement. (See this post by kib@ in the original email
! > discussion. https://lists.freebsd.org/archives/freebsd-fs/2025-October/004704.html)
! > 
! > Since there is no POSIX standard for copy_file_range(), it could
! > be argued that range locking isn't required for copy_file_range(),
! > but that makes it inconsistent with read(2)/write(2) behaviour.
! > (I, personally, am more comfortable with a return after N sec
! > than removing the range locking, but that's just my opinion.)
! 
! Traditionally reads/writes on Unix were atomic but that is not the
! case for NFS, right? That is, while I am reading a file over NFS
! someone else can modify it from another host (if they have write
! permission). That is, AFAIK, the POSIX atomicity requirement for
! ead / write is broken by NFS except for another reader/writer on
! the same host.
! 
! Another issue is that a kernel lock that is held for a very very
! long time is asking for trouble. Ideally one spends as little time
! as possible in the supervisor state and any optimization hacks
! that push logic into the kernel should strive to not hold locks
! for very long so that things don't grind to a complete halt.
! 
! That is, copy_file_range() use in cat(1) seems excessive. The only
! reason for its use seems to be for improving performance.

Not really. ZFS can do what might be best described as
"block-level hardlinks": if you copy an entire block, it is not copied
at all, and, also quite interesting, it doesn't need diskspace.
It is just listed as duplicate.

So, how big is a ZFS block? That depends - the default was 128k, but
now often desired is 1M, and people are already talking about 16M (not
sure if that makes any sense, but probably they will come for 256M
sooner or later :/ ).

I could imagine some kind of max-copy-file-rangelock setting with a
default of 16M, and tunable for better or worse if really needed.
But that is an operator's viewpoint, I do not know if this is
feasible from design/implementation.

! Why not
! break it up in smaller chunks? That way you still get the benefit
! of reducing syscall overhead (which pales in comparision to any
! network reads in any case) + the same skipping over holes. Small
! reads/wries is what we did before  this syscall raised its head!

Out of curiosity I tried to look at practical size parameters for
read()/write() syscalls - and that runs with 1GB blocksize, if you
really want it. So if you do
$ dd if=/dev/random of=./XX bs=1073741824
$ cat ./XX
the cat will wait until the first block is completely
written, and then deliver exactly 1GB.


cheers,
PMc