Re: RFC: Should copy_file_range(2) return after a few seconds?

In reply to: Peter 'PMc' Much: "Re: RFC: Should copy_file_range(2) return after a few seconds?"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Tue, 11 Nov 2025 22:09:50 UTC
On Tue, Nov 11, 2025 at 6:54 AM Peter 'PMc' Much
<pmc@citylink.dinoex.sub.org> wrote:
>
> On Mon, Nov 10, 2025 at 12:58:17AM -0800, Bakul Shah wrote:
> ! On Nov 9, 2025, at 12:52 AM, Rick Macklem <rick.macklem@gmail.com> wrote:
> ! >
> ! > On Sat, Nov 8, 2025 at 11:14 PM Ronald Klop <ronald-lists@klop.ws> wrote:
> ! >>
> ! >> Why is this locking needed?
> ! >> AFAIK Unix has advisory locking, so if you read a file somebody else is writing the result is your own problem. It is up to the applications to adhere to the locking.
> ! >> Is this a lock different than file locking from user space?
> ! > Yes. A rangelock is used for a byte range during a read(2) or
> ! > write(2) to ensure that they are serialized.  This is a POSIX
> ! > requirement. (See this post by kib@ in the original email
> ! > discussion. https://lists.freebsd.org/archives/freebsd-fs/2025-October/004704.html)
> ! >
> ! > Since there is no POSIX standard for copy_file_range(), it could
> ! > be argued that range locking isn't required for copy_file_range(),
> ! > but that makes it inconsistent with read(2)/write(2) behaviour.
> ! > (I, personally, am more comfortable with a return after N sec
> ! > than removing the range locking, but that's just my opinion.)
> !
> ! Traditionally reads/writes on Unix were atomic but that is not the
> ! case for NFS, right? That is, while I am reading a file over NFS
> ! someone else can modify it from another host (if they have write
> ! permission). That is, AFAIK, the POSIX atomicity requirement for
> ! ead / write is broken by NFS except for another reader/writer on
> ! the same host.
> !
> ! Another issue is that a kernel lock that is held for a very very
> ! long time is asking for trouble. Ideally one spends as little time
> ! as possible in the supervisor state and any optimization hacks
> ! that push logic into the kernel should strive to not hold locks
> ! for very long so that things don't grind to a complete halt.
> !
> ! That is, copy_file_range() use in cat(1) seems excessive. The only
> ! reason for its use seems to be for improving performance.
>
> Not really. ZFS can do what might be best described as
> "block-level hardlinks": if you copy an entire block, it is not copied
> at all, and, also quite interesting, it doesn't need diskspace.
> It is just listed as duplicate.
>
> So, how big is a ZFS block? That depends - the default was 128k, but
> now often desired is 1M, and people are already talking about 16M (not
> sure if that makes any sense, but probably they will come for 256M
> sooner or later :/ ).
>
> I could imagine some kind of max-copy-file-rangelock setting with a
> default of 16M, and tunable for better or worse if really needed.
> But that is an operator's viewpoint, I do not know if this is
> feasible from design/implementation.
>
> ! Why not
> ! break it up in smaller chunks? That way you still get the benefit
> ! of reducing syscall overhead (which pales in comparision to any
> ! network reads in any case) + the same skipping over holes. Small
> ! reads/wries is what we did before  this syscall raised its head!
>
> Out of curiosity I tried to look at practical size parameters for
> read()/write() syscalls - and that runs with 1GB blocksize, if you
> really want it. So if you do
> $ dd if=/dev/random of=./XX bs=1073741824
> $ cat ./XX
> the cat will wait until the first block is completely
> written, and then deliver exactly 1GB.
Yes. A large read(2) will exhibit the same property (rangelocked
so other reads are blocked).

It's just that read(2) doesn't normally use really large blocks
(as others noted, 16Mbytes is probably the largest useful
size now or in the near future) whereas using a large size
for copy_file_range(2) is useful. I suppose you could argue for
using 16Mbytes instead of SSIZE_MAX, but I think that will
still end up with increased overheads for ZFS, since it will
require many syscalls instead of one.

I am not at home, but when I get home in a couple of
seeks I'll see how long it takes to copy a large file on
ZFS with block cloning enabled on my very slow hardware
with no separate ZIL device.

rick

>
>
> cheers,
> PMc