RFC: copy_file_range(3)

Fri Oct 2 15:47:39 UTC 2020

[stuff snipped]
Rick Macklem wrote:
>Chris Stephan wrote:
>> New to the list and Late to the discussion. I am thinking increasing the Len could
>> cause possible degradation of performance when used on slower or legacy
>> systems. On the other hand just picking a len and cutting it off at a hard max
>> seems crude even with a tunable. Admittedly my naive opinion in this matter
>> ponders, could there be a sysctl tunable that just sets an estimate on timeframe
>> instead of size. As you said 100M is roughly a sec on modem hardware. IOPS are
>> already tracked. The inverse of the avg IOPS for the filesystem in question could
>> be used against this tunable to derive the estimated size limit of the next
>> read/write. This would allow the max len within the syscall to both honor a
>> timeframe before a signal would be handled and maximize efficiency across a
>> large breadth of systems varying in performance. I’m sure it is more complicated
>> than I suggest... just tossing my 2c in.
>Yes. Using time will work for the generic copy case and I think that's a good idea.
>Then we can leave the file system specific cases up to the implementors.
>(For NFSv4.2, once you send the RPC to the server, the client has no control over
> how long it takes to reply. The current sysctl that sets a size is still about all the
> NFSv4.2 code can do.)
When I looked at a wireshark packet trace, it turned out that the Copy RPC
happened quickly and it was the subsequent Commit RPC that could take
1sec or more.
As such, setting a time limit on Copy was not useful.
Testing shows that 16Mbytes/Copy is small enough to keep the Commit RPC
well below 1sec even on really slow server hardware (Pentium 4 with IDE disk).
There was also no appreciable performance improvement for Copy sizes
greater than 16Mbytes for the testing I did.
As such, I think the vfs.nfs.maxcopyrange sysctl with a default of 16Mbytes
is all that can be done for NFSv4.2.

For local file systems, a patch that detects pending signals is in progress.

rick

Thanks for the suggestion, rick

Chris

Sent from FreeBSD
________________________________
From: owner-freebsd-hackers at freebsd.org <owner-freebsd-hackers at freebsd.org> on behalf of Rick Macklem <rmacklem at uoguelph.ca>
Sent: Sunday, September 20, 2020 11:28:21 PM
To: Alan Somers <asomers at freebsd.org>
Cc: FreeBSD Hackers <freebsd-hackers at freebsd.org>
Subject: Re: RFC: copy_file_range(3)

[I have only indented your most recent comments]
Alan Somers wrote:
On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem <rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
Alan Somers wrote:
>On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem <rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca><mailto:rmacklem at uoguelph.ca<mailto:rmacklem at uoguelph.ca>>> wrote:
>>Alan Somers wrote:
>>>copy_file_range(2) is nifty, but it has a few sharp edges:
>>>1) Certain file systems don't support it, necessitating a write/read based
>>>fallback
>>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA
>>>3) It's slightly tricky to both efficiently deal with holes and also
>>>promptly respond to signals
>>>
>>>These problems aren't terribly hard, but it seems to me like most
>>>applications that use copy_file_range would share the exact same
>>>solutions.  In particular, I'm thinking about cp(1), dd(1), and
>>>install(8).  Those three could benefit from sharing a userland wrapper that
>>>handles the above problems.
>>>
>>>Should we add such a wrapper to libc?  If so, what should it be called, and
>>>should it be public or just private to /usr/src ?
>>There has been a discussion on src-committers which I suggested should
>>be taken to a public mailing list.
>>
>>The basic question is...
>>Whether or not the copy_file_range(2) syscall should be compatible with
>>the Linux one.
>>When I did the syscall, I tried to make it Linux-compatible, arguing that
>>Linux is now a de-facto standard.
>>The Linux syscall only works on regular files, which is why Alan's patch for
>>cp required a "fallback to the old way" for VCHR files like /dev/null.
>>
>>He is considering a wrapper in libc to provide FreeBSD specific semantics,
>>which I have no problem with, so long as the naming and man page make
>>it clear that it is not compatible with the Linux syscall.
>>(Personally, I'd prefer a wrapper in libc to making the actual syscall non-Linux
>> compatible, but that is just mho.)
>>
>>Hopefully this helps clarify what Alan is asking, rick
>>
>>I don't think the two questions are equivalent.  I think that copy_file_range(2) >>ought to work on character devices.  Separately, even it does, I think a userland >>wrapper would still be useful.  It would still be able to handle sparse files more >>efficiently than the kernel-based vn_generic_copy_file_range.
I saw this also stated in your #2 above, but wonder why you think a wrapper
would handle holes more efficiently.
vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE
just like a wrapper would and retains them as far as possible. It also looks
for blocks of all zero bytes for file systems that do not support SEEK_DATA/
SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in
the output file.
--> The only cases that I am aware of where the holes are not retained are:
     - When the min holesize for the output file is larger than that of the
       input file.
     - When the hole straddles the byte range specified for the syscall.
       (Or when the hole straddles two copy_file_range(2) syscalls, if you
        prefer.)

If you are copying the entire file and do not care how long the syscall
takes (which also implies how long it will take for a termination signal
like <ctrl>C to be handled), the most efficient usage is to specify
a "len" argument equal to UINT64_MAX.
--> This will usually copy the whole file in one gulp, although it is not
       guaranteed to copy everything, given the Linux semantics definition
       of it (an NFSv4.2 server can simply choose to copy less, for example).
       --> This allows the kernel to use whatever block size works efficiently
             and does not require an allocation of a large userspace buffer for
             the date, nor that the data be copied to/from userspace.

The problem with doing the whole file in one gulp are:
- A large file can take quite a while and any signal won't be processed until
  the gulp is done.
  --> If you wrote a program that allocated a 100Gbyte buffer and then
        copied a file using read(2)/write(2) with a size of 100Gbytes in a loop,
        you'd end up with the same result.
- As kib@ noted, if the input file never reports EOF (as /dev/zero does),
      then the "one gulp" wouldn't end until storage is exhausted on the
      output file(s) device and <crtl>C wouldn't stop it (since it is one big
      syscall).
     --> As such, I suggested that, if the syscall is extended to allow VCHR,
           that the "len" argument be clipped at "K Mbytes" for that case to
           avoid filling the storage device before being able to <ctrl>C out
           of it, for this case.
I suppose the answer for #3 is...
- smaller "len" allows for quicker response to signals
but
- smaller "len" results in less efficient use of the syscall.

Your patch for "cp" seemed fine, but used a small "len" and, as such,
made the use of copy_file_range(2) less efficient.

All I see the wrapper dong is handling the VCHR case (if the syscall remains
as it is now and returns EINVAL to be compatible with Linux) and making
some rather arbitrary choice w.r.t. how big "len" should be.
--> Choosing an appropriate "len" might better be left to the specific use
      case, I think?

In summary, it's mostly whether VCHR gets handled by the syscall or a
wrapper?

> 1) In order to quickly respond to a signal, a program must use a modest len with > copy_file_range
Does this matter? Or put another way, is a 1-2sec delay in response to <crtl>C
an issue for "cp".
When kib@ reviewed the syscall, he did not see the delay in signal handling
a significant problem, noting that it is no different than a large process core
dumping.

> 2) If a hole is larger than len, that will cause vn_generic_copy_file_range to
> truncate the output file to the middle of the hole.  Then, in the next invocation,
> truncate it again to a larger size.
> 3) The result is a file that is not as sparse as the original.
>
> For example, on UFS:
> $ truncate -s 1g sparsefile
> $ cp sparsefile sparsefile2
> $ du -sh sparsefile*
>  96K sparsefile
> 32M sparsefile2
If you care about maintaining sparseness, a "len" of 100Mbytes or more would
be a reasonable choice. Since "cp" has never maintained sparseness, I didn't
suggest such a size when I reviewed your patch for "cp".
--> I/O subsystem performance varies widely, but I think 100Mbytes will limit
      the delay in signal handling to about 1sec. Isn't that quick enough?

> My idea for a userland wrapper would solve this problem by using
> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_range for
> everything else with a modest len.  Alternatively, we could eliminate the need for
> the wrapper by enabling copy_file_range for every file system, and making
> vn_generic_copy_file_range interruptible, so copy_file_range can be called with
> large len without penalizing signal handling performance.
The problem with doing this is it largely defeats the purpose of copy_file_range().
1 - What about file systems that do not support SEEK_DATA/SEEK_HOLE.
     (All NFS mounts except NFSv4.2 ones against servers that support the
      NFSv4.2 Seek operation are in this category.)
2 - For NFSv4.2 with servers that support Seek, the copy of an entire file
     can be done via a few (or only one) RPC if you make "len" large and
     don't use Seek.
     If you combine using Seek with len ==2Mbytes, then you do a lot more RPCs
     with associated overheads and RPC RTT delays. You still avoid moving all
     the data across the wire, but you do lose a lot of the performance advantage.

I could have made copy_file_range(2) a lot simpler if the generic code didn't
try and maintain holes, but I wanted it to work well for file systems that did
not support SEEK_DATA/SEEK_HOLE.

I'd suggest you try patching "cp" to use a 100Mbyte "len" for copy_file_range()
and test that.
You should fine the sparseness is mostly maintained and that you can <crtl>C
out of a large file copy without undue delay.
Then try it over NFS mounts (both v4.2 and v3) for the same large sparse file.

You can also code up a patched "cp" using SEEK_DATA/SEEK_HOLE and see
how they compare.

rick

-Alan
_______________________________________________
freebsd-hackers at freebsd.org mailing list
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freebsd.org%2Fmailman%2Flistinfo%2Ffreebsd-hackers&data=02%7C01%7C%7C27ea5166cf99415d3bba08d85de6d259%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637362593231297450&sdata=Sfm9MxjQ6MVHgG%2Fw3sghn0hebSFjZo%2FSaUyZ9HPyws8%3D&reserved=0
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
_______________________________________________
freebsd-hackers at freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
_______________________________________________
freebsd-hackers at freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"