From nobody Fri Oct 01 23:13:10 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id AFA9B17D3EAF for ; Fri, 1 Oct 2021 23:13:28 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f43.google.com (mail-ot1-f43.google.com [209.85.210.43]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HLmBJ4WFtz3hJK for ; Fri, 1 Oct 2021 23:13:28 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f43.google.com with SMTP id o59-20020a9d2241000000b0054745f28c69so13342518ota.13 for ; Fri, 01 Oct 2021 16:13:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ssTP6kWyca/2ZWVZclQTG2yHle5HCQIlvKADi/6gfek=; b=calQARxCzu2KLDi5se7JEEfTwRWDtu+/eF0kI/i+OT3sQoM6LoDU4KZayqZ6XK5bCZ coKxGIwyIf0tADXQbTSY6yDOwzecbJM01e1Bz1UnfS70iZKWAdFqBOCwWFtUsK3kZkDS qZGcSnQIbGsrj1JdWNv4J/P2ezztE9qTtgefv2JSQ/pt80106k4T4FpKA7Ama4SoT8Uu ug1lmdOG9Uz1f3tkfY/lFbiPK39fX4rUnxG/b5K0DfDYlAPTWKVf12jjgim7RaY9DhV9 LogZIqGq9+QZAEIOnBtgO0Kpj/cK7sG0mXji42GfdipizwSK41zZZ0Qv6wV4d1MWkBNe RpOA== X-Gm-Message-State: AOAM532FP+st1EfdaIUBGr1iTWU+mIHz3kX9d3sagWwnzL2KnT0ne//V pA5adXTJN9w2PeZzOzy+6MvRnAk/JoZn7AN76Es9dYsY X-Google-Smtp-Source: ABdhPJyCb9qh5xgEt1IuRK+1lFKvzeAJL6SA7I7civ1fW3/cufnP1iWzSt/6CPsXml3MGRJCymUUEvF+nS0eQ3jBbjA= X-Received: by 2002:a05:6830:1509:: with SMTP id k9mr396086otp.111.1633130002135; Fri, 01 Oct 2021 16:13:22 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Fri, 1 Oct 2021 17:13:10 -0600 Message-ID: Subject: Re: livelock in vfs_bio_getpages with vn_io_fault_uiomove To: Konstantin Belousov Cc: FreeBSD Hackers Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4HLmBJ4WFtz3hJK X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-ThisMailContainsUnwantedMimeParts: N On Thu, Sep 30, 2021 at 10:04 PM Konstantin Belousov wrote: > > On Thu, Sep 30, 2021 at 09:59:11PM -0600, Alan Somers wrote: > > On Thu, Sep 30, 2021 at 9:53 PM Konstantin Belousov wrote: > > > > > > On Thu, Sep 30, 2021 at 09:33:24PM -0600, Alan Somers wrote: > > > > On Thu, Sep 30, 2021 at 9:08 PM Konstantin Belousov wrote: > > > > > > > > > > On Thu, Sep 30, 2021 at 03:11:02PM -0600, Alan Somers wrote: > > > > > > I'm trying to adapt fusefs to use vn_io_fault_uiomove to fix the > > > > > > deadlock described in the comments above vn_io_fault_doio [^1]. I can > > > > > > reproduce the deadlock readily enough, and the fix seems simple enough > > > > > > in other filesystems [^2][^3]. But when I try to apply the same fix > > > > > > to fusefs, the deadlock changes into a livelock. vfs_bio_getpages > > > > > > loops infinitely because it reaches the "redo = true" state. But on > > > > > > looping, it never attempts to read from fusefs again. Instead, > > > > > > breadn_flags returns 0 without ever calling bufstrategy, from which I > > > > > > infer that getblkx returned a block with B_CACHE set. Despite that, > > > > > > at least one of the requested pages in vfs_bio_getpages fails the > > > > > > vm_page_all_valid(ma[i]) check. Debugging further is wandering > > > > > > outside my areas of expertise. Could somebody please give me a tip? > > > > > > What is supposed to mark those pages as valid? Are there any other > > > > > > undocumented conditions needed to use vn_io_fault_uiomove that msdosfs > > > > > > and nfscl just happened to already meet? > > > > > > > > > > > > Grateful for any help, > > > > > > -Alan > > > > > > > > > > > > [^1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238340 > > > > > > [^2] https://github.com/freebsd/freebsd-src/commit/2aa3944510b50cbe6999344985a5a9c3208063b2 > > > > > > [^3] https://github.com/freebsd/freebsd-src/commit/ddfc47fdc98460b757c6d1dbe4562a0a339f228b > > > > > > > > > > Perhaps first you need to confirm that vfs_bio_getpages() sees a buffer > > > > > with B_CACHE set but some pages still not fully valid (except the last > > > > > page that is allowed to have invalid tail at EOF). > > > > > > > > > > When buffer strategy read returns, vfs_vmio_iodone() updates the pages > > > > > validity bitmap according to the b_resid field of the buffer. Look there, > > > > > might be you did not set it properly. Practically, both b_resid and b_bcount > > > > > must be correctly set after io. > > > > > > > > > > In fact, after writing out all that, I realized that I am confused > > > > > by your question. vn_io_fault_uiomove() needs to be used from > > > > > VOP_READ/VOP_WRITE. If filesystem utilizes buffer cache, then there is > > > > > a VOP_STRATEGY() implementation that fullfils the buffer cache requests > > > > > for buffers reads and writes. VOP_READ and VOP_STRATEGY simply occurs > > > > > at very different layers of the io stack. Typically, VOP_READ() does > > > > > bread() which might trigger VOP_STRATEGY() to get the buffer, and then > > > > > it performs vn_io_fault() to move data from locked buffer to userspace. > > > > > > > > > > The fact that your addition of vn_io_fault breaks something in VOP_STRATEGY() > > > > > does not make sense. > > > > > > > > Ahh, that last piece of information is useful. In fusefs, both > > > > VOP_READ and VOP_STRATEGY can end up going through the same path, > > > > depending on cache settings, O_DIRECT, etc. And during a VOP_STRATEGY > > > > read, fusefs needs to use uiomove in order to move data not into the > > > > user's buffer, but from the fuse daemon into the kernel. But that's > > > > not the cause of the livelock, because whether I use uiomove or > > > > vn_io_fault_uiomove during VOP_STRATEGY I still get the same livelock. > > > > I'll check b_bcount and b_resid tomorrow. > > > > > > But uiomove call from VOP_STRATEGY() to copy user data into kernel buffer > > > is not prepared for vn_io_fault. I suspect that what happens there is > > > the following: > > > - top level of syscall, like read(2), does vn_read() > > > - vn_read() checks conditions and goes through vn_io_fault, calling VOP_READ() > > > there it prepares prefaulted held pages _for userspace buffer from read(2)_ > > > - your VOP_READ() calls into VOP_STRATEGY() that tries to copy data into > > > kernel. If you use vn_io_fault_uiomove() at this point, it wrongly > > > consumes held pages for unrelated userspace buffer > > > > > > Even if there is some additional bug with b_resid, I suspect that the > > > end result is the mess anyway. You should not use vn_io_fault for recursive > > > accesses to userspace, only for VOP_READ/VOP_WRITE level accesses. > > > > How would vn_io_fault_uiomove "wrong consumes held pages"? Are they > > attached to the uio? Because the call that copies from userspace > > (/dev/fuse) into the kernel doesn't have access to the "struct buf". > It is explained in the comment you referenced. > > Pages from the userspace buffer passed to read(2) or write(2) are held > before calling into VOP_READ/VOP_WRITE. They are stored in the array > of pages pointed to by curthread->td_ma. vn_io_fault_uiomove() uses that > array instead of uio if TDP_UIOHELD is set. So I've been looking at the read operations all of this time, but the problematic buffer was actually the BIO_WRITE one. It has B_CACHE but none of its page are valid. Apparently, VOP_WRITE is supposed to call vfs_bio_clrbuf() after allocating a new buffer to set those valid bits? But it never mattered before.