Re: livelock in vfs_bio_getpages with vn_io_fault_uiomove

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Fri, 01 Oct 2021 04:04:06 UTC
On Thu, Sep 30, 2021 at 09:59:11PM -0600, Alan Somers wrote:
> On Thu, Sep 30, 2021 at 9:53 PM Konstantin Belousov <kostikbel@gmail.com> wrote:
> >
> > On Thu, Sep 30, 2021 at 09:33:24PM -0600, Alan Somers wrote:
> > > On Thu, Sep 30, 2021 at 9:08 PM Konstantin Belousov <kostikbel@gmail.com> wrote:
> > > >
> > > > On Thu, Sep 30, 2021 at 03:11:02PM -0600, Alan Somers wrote:
> > > > > I'm trying to adapt fusefs to use vn_io_fault_uiomove to fix the
> > > > > deadlock described in the comments above vn_io_fault_doio [^1].  I can
> > > > > reproduce the deadlock readily enough, and the fix seems simple enough
> > > > > in other filesystems [^2][^3].  But when I try to apply the same fix
> > > > > to fusefs, the deadlock changes into a livelock.  vfs_bio_getpages
> > > > > loops infinitely because it reaches the "redo = true" state.  But on
> > > > > looping, it never attempts to read from fusefs again.  Instead,
> > > > > breadn_flags returns 0 without ever calling bufstrategy, from which I
> > > > > infer that getblkx returned a block with B_CACHE set.  Despite that,
> > > > > at least one of the requested pages in vfs_bio_getpages fails the
> > > > > vm_page_all_valid(ma[i]) check.  Debugging further is wandering
> > > > > outside my areas of expertise.  Could somebody please give me a tip?
> > > > > What is supposed to mark those pages as valid?  Are there any other
> > > > > undocumented conditions needed to use vn_io_fault_uiomove that msdosfs
> > > > > and nfscl just happened to already meet?
> > > > >
> > > > > Grateful for any help,
> > > > > -Alan
> > > > >
> > > > > [^1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238340
> > > > > [^2] https://github.com/freebsd/freebsd-src/commit/2aa3944510b50cbe6999344985a5a9c3208063b2
> > > > > [^3] https://github.com/freebsd/freebsd-src/commit/ddfc47fdc98460b757c6d1dbe4562a0a339f228b
> > > >
> > > > Perhaps first you need to confirm that vfs_bio_getpages() sees a buffer
> > > > with B_CACHE set but some pages still not fully valid (except the last
> > > > page that is allowed to have invalid tail at EOF).
> > > >
> > > > When buffer strategy read returns, vfs_vmio_iodone() updates the pages
> > > > validity bitmap according to the b_resid field of the buffer.  Look there,
> > > > might be you did not set it properly.  Practically, both b_resid and b_bcount
> > > > must be correctly set after io.
> > > >
> > > > In fact, after writing out all that, I realized that I am confused
> > > > by your question. vn_io_fault_uiomove() needs to be used from
> > > > VOP_READ/VOP_WRITE. If filesystem utilizes buffer cache, then there is
> > > > a VOP_STRATEGY() implementation that fullfils the buffer cache requests
> > > > for buffers reads and writes. VOP_READ and VOP_STRATEGY simply occurs
> > > > at very different layers of the io stack. Typically, VOP_READ() does
> > > > bread() which might trigger VOP_STRATEGY() to get the buffer, and then
> > > > it performs vn_io_fault() to move data from locked buffer to userspace.
> > > >
> > > > The fact that your addition of vn_io_fault breaks something in VOP_STRATEGY()
> > > > does not make sense.
> > >
> > > Ahh, that last piece of information is useful.  In fusefs, both
> > > VOP_READ and VOP_STRATEGY can end up going through the same path,
> > > depending on cache settings, O_DIRECT, etc.  And during a VOP_STRATEGY
> > > read, fusefs needs to use uiomove in order to move data not into the
> > > user's buffer, but from the fuse daemon into the kernel.  But that's
> > > not the cause of the livelock, because whether I use uiomove or
> > > vn_io_fault_uiomove during VOP_STRATEGY I still get the same livelock.
> > > I'll check b_bcount and b_resid tomorrow.
> >
> > But uiomove call from VOP_STRATEGY() to copy user data into kernel buffer
> > is not prepared for vn_io_fault.  I suspect that what happens there is
> > the following:
> > - top level of syscall, like read(2), does vn_read()
> > - vn_read() checks conditions and goes through vn_io_fault, calling VOP_READ()
> >   there it prepares prefaulted held pages _for userspace buffer from read(2)_
> > - your VOP_READ() calls into VOP_STRATEGY() that tries to copy data into
> >   kernel.  If you use vn_io_fault_uiomove() at this point, it wrongly
> >   consumes held pages for unrelated userspace buffer
> >
> > Even if there is some additional bug with b_resid, I suspect that the
> > end result is the mess anyway.  You should not use vn_io_fault for recursive
> > accesses to userspace, only for VOP_READ/VOP_WRITE level accesses.
> 
> How would vn_io_fault_uiomove "wrong consumes held pages"?  Are they
> attached to the uio?  Because the call that copies from userspace
> (/dev/fuse) into the kernel doesn't have access to the "struct buf".
It is explained in the comment you referenced.

Pages from the userspace buffer passed to read(2) or write(2) are held
before calling into VOP_READ/VOP_WRITE.  They are stored in the array
of pages pointed to by curthread->td_ma.  vn_io_fault_uiomove() uses that
array instead of uio if TDP_UIOHELD is set.