From nobody Sat Oct 02 00:43:47 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id E7E3317DDD48 for ; Sat, 2 Oct 2021 00:43:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4HLpBh4YhXz3rCb; Sat, 2 Oct 2021 00:43:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.16.1/8.16.1) with ESMTPS id 1920hlsM012279 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Sat, 2 Oct 2021 03:43:50 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 1920hlsM012279 Received: (from kostik@localhost) by tom.home (8.16.1/8.16.1/Submit) id 1920hlAQ012278; Sat, 2 Oct 2021 03:43:47 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 2 Oct 2021 03:43:47 +0300 From: Konstantin Belousov To: Alan Somers Cc: FreeBSD Hackers Subject: Re: livelock in vfs_bio_getpages with vn_io_fault_uiomove Message-ID: References: List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.5 X-Spam-Checker-Version: SpamAssassin 3.4.5 (2021-03-20) on tom.home X-Rspamd-Queue-Id: 4HLpBh4YhXz3rCb X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-ThisMailContainsUnwantedMimeParts: N On Fri, Oct 01, 2021 at 05:13:10PM -0600, Alan Somers wrote: > On Thu, Sep 30, 2021 at 10:04 PM Konstantin Belousov > wrote: > > > > On Thu, Sep 30, 2021 at 09:59:11PM -0600, Alan Somers wrote: > > > On Thu, Sep 30, 2021 at 9:53 PM Konstantin Belousov wrote: > > > > > > > > On Thu, Sep 30, 2021 at 09:33:24PM -0600, Alan Somers wrote: > > > > > On Thu, Sep 30, 2021 at 9:08 PM Konstantin Belousov wrote: > > > > > > > > > > > > On Thu, Sep 30, 2021 at 03:11:02PM -0600, Alan Somers wrote: > > > > > > > I'm trying to adapt fusefs to use vn_io_fault_uiomove to fix the > > > > > > > deadlock described in the comments above vn_io_fault_doio [^1]. I can > > > > > > > reproduce the deadlock readily enough, and the fix seems simple enough > > > > > > > in other filesystems [^2][^3]. But when I try to apply the same fix > > > > > > > to fusefs, the deadlock changes into a livelock. vfs_bio_getpages > > > > > > > loops infinitely because it reaches the "redo = true" state. But on > > > > > > > looping, it never attempts to read from fusefs again. Instead, > > > > > > > breadn_flags returns 0 without ever calling bufstrategy, from which I > > > > > > > infer that getblkx returned a block with B_CACHE set. Despite that, > > > > > > > at least one of the requested pages in vfs_bio_getpages fails the > > > > > > > vm_page_all_valid(ma[i]) check. Debugging further is wandering > > > > > > > outside my areas of expertise. Could somebody please give me a tip? > > > > > > > What is supposed to mark those pages as valid? Are there any other > > > > > > > undocumented conditions needed to use vn_io_fault_uiomove that msdosfs > > > > > > > and nfscl just happened to already meet? > > > > > > > > > > > > > > Grateful for any help, > > > > > > > -Alan > > > > > > > > > > > > > > [^1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238340 > > > > > > > [^2] https://github.com/freebsd/freebsd-src/commit/2aa3944510b50cbe6999344985a5a9c3208063b2 > > > > > > > [^3] https://github.com/freebsd/freebsd-src/commit/ddfc47fdc98460b757c6d1dbe4562a0a339f228b > > > > > > > > > > > > Perhaps first you need to confirm that vfs_bio_getpages() sees a buffer > > > > > > with B_CACHE set but some pages still not fully valid (except the last > > > > > > page that is allowed to have invalid tail at EOF). > > > > > > > > > > > > When buffer strategy read returns, vfs_vmio_iodone() updates the pages > > > > > > validity bitmap according to the b_resid field of the buffer. Look there, > > > > > > might be you did not set it properly. Practically, both b_resid and b_bcount > > > > > > must be correctly set after io. > > > > > > > > > > > > In fact, after writing out all that, I realized that I am confused > > > > > > by your question. vn_io_fault_uiomove() needs to be used from > > > > > > VOP_READ/VOP_WRITE. If filesystem utilizes buffer cache, then there is > > > > > > a VOP_STRATEGY() implementation that fullfils the buffer cache requests > > > > > > for buffers reads and writes. VOP_READ and VOP_STRATEGY simply occurs > > > > > > at very different layers of the io stack. Typically, VOP_READ() does > > > > > > bread() which might trigger VOP_STRATEGY() to get the buffer, and then > > > > > > it performs vn_io_fault() to move data from locked buffer to userspace. > > > > > > > > > > > > The fact that your addition of vn_io_fault breaks something in VOP_STRATEGY() > > > > > > does not make sense. > > > > > > > > > > Ahh, that last piece of information is useful. In fusefs, both > > > > > VOP_READ and VOP_STRATEGY can end up going through the same path, > > > > > depending on cache settings, O_DIRECT, etc. And during a VOP_STRATEGY > > > > > read, fusefs needs to use uiomove in order to move data not into the > > > > > user's buffer, but from the fuse daemon into the kernel. But that's > > > > > not the cause of the livelock, because whether I use uiomove or > > > > > vn_io_fault_uiomove during VOP_STRATEGY I still get the same livelock. > > > > > I'll check b_bcount and b_resid tomorrow. > > > > > > > > But uiomove call from VOP_STRATEGY() to copy user data into kernel buffer > > > > is not prepared for vn_io_fault. I suspect that what happens there is > > > > the following: > > > > - top level of syscall, like read(2), does vn_read() > > > > - vn_read() checks conditions and goes through vn_io_fault, calling VOP_READ() > > > > there it prepares prefaulted held pages _for userspace buffer from read(2)_ > > > > - your VOP_READ() calls into VOP_STRATEGY() that tries to copy data into > > > > kernel. If you use vn_io_fault_uiomove() at this point, it wrongly > > > > consumes held pages for unrelated userspace buffer > > > > > > > > Even if there is some additional bug with b_resid, I suspect that the > > > > end result is the mess anyway. You should not use vn_io_fault for recursive > > > > accesses to userspace, only for VOP_READ/VOP_WRITE level accesses. > > > > > > How would vn_io_fault_uiomove "wrong consumes held pages"? Are they > > > attached to the uio? Because the call that copies from userspace > > > (/dev/fuse) into the kernel doesn't have access to the "struct buf". > > It is explained in the comment you referenced. > > > > Pages from the userspace buffer passed to read(2) or write(2) are held > > before calling into VOP_READ/VOP_WRITE. They are stored in the array > > of pages pointed to by curthread->td_ma. vn_io_fault_uiomove() uses that > > array instead of uio if TDP_UIOHELD is set. > > So I've been looking at the read operations all of this time, but the > problematic buffer was actually the BIO_WRITE one. It has B_CACHE but > none of its page are valid. Apparently, VOP_WRITE is supposed to call > vfs_bio_clrbuf() after allocating a new buffer to set those valid > bits? But it never mattered before. I do not understand your question. If B_CACHE is set, all pages in the buffer should be valid, i.e. m->valid bitset should be all 1's _and_ the actual page content should be valid. Just populating the bitset without populating page content means that the user data is corrupted. VOP_WRITE() needs to instantiate the buffer which is written to, and there are two possibilities: - the write request covers the whole buffer. In this case there is no need to read the old content. If the write fails for any reason, e.g. due to uiomove() reporting EFAULT, and there were not B_CACHE set, the buffer is typically invalidated by setting B_INVAL | B_RELBUF | B_NOCACHE and the brelse()ing it. If the buffer had the valid content before uiomove(), it is typically worth to try to save the data, but everything should be valid unless it is invalidated. - if the write request does not cover the whole buffer, it must be read. For instance, FFS specifies BA_CLRBUF to UFS_BALLOC() for partial write, indicating that returned buffer must be read. On uiomove() fault, FFS zeroes non-validated parts of the buffer, de-facto considering unreadable user buffer as zero-filled.