RFC: should the copy_file_range() syscall be atomic?

Thu Jun 13 22:32:42 UTC 2019

Konstantin Belousov wrote:
>On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote:
>> When I first wrote the copy_file_range() syscall, the updating of the file was atomic,
>> because I locked both vnodes while doing it.
>> kib@ quickly pointed out that this could introduce a LOR/deadlock because two
>> userland threads could concurrently do the syscall with the file arguments reversed.
>>
>> It turns out that locking two VREG vnodes concurrently to do this isn't easy and
>> would require the implementation of non-blocking versions of:
>>   vn_rangelock_rlock() and vn_rangelock_wlock()
>>   - I am not sure how difficult doing this is, but I'll admit I'd rather not do this.
>I will help you with this.  It should be relatively easy, even if requires
>some code re-shuffling.
Ok, I'll wait a while to see if others think this should be done. Then if they do, I'll
take a first stab at it and then pass it along to you.

>>
>> Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA/
>> FIOSEEKHOLE..) calls to find holes in the byte range of the file being copied from.
>If you lock ranges, you still can do ioctl.
Ok, so you are proposing to lock the byte ranges, but lock/unlock the vnodes
and do the vn_start_write()/vn_finished_write() for each block read/written so the FIOSEEK* can be done in the loop?

That sounds fine with me, once the concurrent range locks can be acquired safely.

> But in fact it might be better
>to extract wrapper for FIOSEEK* into kernel function, and use it in the
>ioctl handler and in the copy syscall.
It's just a VOP_IOCTL() call, so I don't see a need to wrap it?
(The only trick is that the vnode must be unlocked when you do the call.)

>>
>> Without the vnodes locked, it is possible for other threads to write to either of the
>> files concurrently with the copy_file_range(), resulting in an indeterminate results.
>> (cp(1) has never guaranteed atomic copying of files, so is it needed in this syscall?)
>> In summary, doing the syscall non-atomically has the advantages of:
>> - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.t. locking
>>   wouldn't require changes to this syscall's code.
>> - VOP_IOCTL() could be used to find holes.
>> - No new rangelock primitives need to be added to the system.
>> - If there were some combination of file system/storage unit where copying
>>   non-overlapping byte ranges concurrently could result in better performance,
>>   then that could be done. (An atomic syscall would serialize them.)
>>
>> The advantage of an atomic syscall would be consistency with write(2) and read(2)
>> behaviour.
>>
>> The following comments are copied from phabricator:
>> kib@ - So you relock range for each chunk ? This defeats the purpose of the range locking. Should copy_file_range() be atomic WRT other reads and writes ?
>>
>> asomers@ - That has an unfortunate side effect: copy_file_range is no longer atomic if you drop the vnode locks and range locks in the middle. It would be possible for two copy_file_range operations to proceed concurrently, leaving the file in a state where each of the operations was partially successful. A better solution would be to add rangelock_trywlock and rangelock_tryrlock. I don't think it would be very hard (testing them, however, could be).
>>
>> I don't see anything in the Linux man page w.r.t. atomicity, so I am now asking what
>> others think?
>> (I'll admit I'm biased towards non-atomic, since I have already coded it and can use
>>  the VOP_IOCTL() calls to find the holes in the input file, but...)
>
>AFAIK, linux does not provide atomicity for file reads and writes at all,
>even for normal reads and writes.  I do not want to read Linux code to
>confirm this.
Heh, heh. I don't want to read Linux code to verify this either;-)
I end up reading their NFS code once in a while, but that's enough for me.

However, it does mean that no one will expect atomic behaviour if Linux doesn't
provide it.

One amusing property of this syscall is that it fails if offset+len exceeds EOF.
The example in the Linux man page first stat(2)s the input file to find out its
size and then copies that much using copy_file_range(2) in a loop.
Of course, if the size changes between the stat(2) and the copy_file_range(2)
calls, all bets are off.
Also, the Linux man page says it can return having copied less than requested,
although it doesn't explain when/why. It does state that exceeding EOF returns
EINVAL, so it isn't that it hits EOF on the input file.