RFC: should the copy_file_range() syscall be atomic?

Thu Jun 13 22:25:49 UTC 2019

On Thu, Jun 13, 2019 at 4:08 PM Konstantin Belousov <kostikbel at gmail.com> wrote:
>
> On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote:
> > When I first wrote the copy_file_range() syscall, the updating of the file was atomic,
> > because I locked both vnodes while doing it.
> > kib@ quickly pointed out that this could introduce a LOR/deadlock because two
> > userland threads could concurrently do the syscall with the file arguments reversed.
> >
> > It turns out that locking two VREG vnodes concurrently to do this isn't easy and
> > would require the implementation of non-blocking versions of:
> >   vn_rangelock_rlock() and vn_rangelock_wlock()
> >   - I am not sure how difficult doing this is, but I'll admit I'd rather not do this.
> I will help you with this.  It should be relatively easy, even if requires
> some code re-shuffling.
>
> >
> > Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA/
> > FIOSEEKHOLE..) calls to find holes in the byte range of the file being copied from.
> If you lock ranges, you still can do ioctl.  But in fact it might be better
> to extract wrapper for FIOSEEK* into kernel function, and use it in the
> ioctl handler and in the copy syscall.
>
> >
> > Without the vnodes locked, it is possible for other threads to write to either of the
> > files concurrently with the copy_file_range(), resulting in an indeterminate results.
> > (cp(1) has never guaranteed atomic copying of files, so is it needed in this syscall?)
> > In summary, doing the syscall non-atomically has the advantages of:
> > - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.t. locking
> >   wouldn't require changes to this syscall's code.
> > - VOP_IOCTL() could be used to find holes.
> > - No new rangelock primitives need to be added to the system.
> > - If there were some combination of file system/storage unit where copying
> >   non-overlapping byte ranges concurrently could result in better performance,
> >   then that could be done. (An atomic syscall would serialize them.)
> >
> > The advantage of an atomic syscall would be consistency with write(2) and read(2)
> > behaviour.
> >
> > The following comments are copied from phabricator:
> > kib@ - So you relock range for each chunk ? This defeats the purpose of the range locking. Should copy_file_range() be atomic WRT other reads and writes ?
> >
> > asomers@ - That has an unfortunate side effect: copy_file_range is no longer atomic if you drop the vnode locks and range locks in the middle. It would be possible for two copy_file_range operations to proceed concurrently, leaving the file in a state where each of the operations was partially successful. A better solution would be to add rangelock_trywlock and rangelock_tryrlock. I don't think it would be very hard (testing them, however, could be).
> >
> > I don't see anything in the Linux man page w.r.t. atomicity, so I am now asking what
> > others think?
> > (I'll admit I'm biased towards non-atomic, since I have already coded it and can use
> >  the VOP_IOCTL() calls to find the holes in the input file, but...)
>
> AFAIK, linux does not provide atomicity for file reads and writes at all,
> even for normal reads and writes.  I do not want to read Linux code to
> confirm this.

Really?  I find that hard to believe.  But sure enough, here's the
proof: (they claim it's fixed now)
http://man7.org/linux/man-pages/man2/write.2.html .  As for
copy_file_range, if it's not implemented by the file system, Linux
falls back to an implementation based on splice(2).  I can't find
anything that says whether splice has atomic writes, and in a quick
scan of the code I can't find any sign of locking.  However, glibc
also contains a userland implementation of copy_file_range.  It works
8KB at a time and is most definitely NOT atomic.

-Alan