Re: git: 99cb3dca4773 - main - vnode: Rework vput() to avoid holding the vnode lock after decrementing

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Sat, 15 Nov 2025 04:18:57 UTC
On Sat, Nov 15, 2025 at 3:12 AM John Baldwin <jhb@freebsd.org> wrote:
>
> On 11/11/25 09:59, Mark Johnston wrote:
> > The branch main has been updated by markj:
> >
> > URL: https://cgit.FreeBSD.org/src/commit/?id=99cb3dca4773fe4a16c500f9cb55fcd62cd8d7f3
> >
> > commit 99cb3dca4773fe4a16c500f9cb55fcd62cd8d7f3
> > Author:     Mark Johnston <markj@FreeBSD.org>
> > AuthorDate: 2025-11-11 14:47:06 +0000
> > Commit:     Mark Johnston <markj@FreeBSD.org>
> > CommitDate: 2025-11-11 14:58:59 +0000
> >
> >      vnode: Rework vput() to avoid holding the vnode lock after decrementing
> >
> >      It is not safe to modify the vnode structure after releasing one's
> >      reference.  Modify vput() to avoid this.  Use refcount_release_if_last()
> >      to opportunistically call vput_final() with the vnode lock held since we
> >      need the vnode lock in order to deactivate the vnode, and it's silly to
> >      drop the vnode lock and immediately reacquire it in this common case.
> >
> >      Note that vunref() has a similar flaw.  D52628 aims to fix the problem
> >      more holistically, but this change fixes observable panics in the
> >      meantime.
> >
> >      Reported by:    syzbot+6676b3ff282d590b0fb3@syzkaller.appspotmail.com
> >      Reported by:    syzbot+38e26cf6f959e886f110@syzkaller.appspotmail.com
> >      Reviewed by:    kib
> >      MFC after:      3 days
> >      Differential Revision:  https://reviews.freebsd.org/D52608
> > ---
> >   sys/kern/vfs_subr.c | 7 ++++---
> >   1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
> > index 58975f7ac932..9cf983f6f89d 100644
> > --- a/sys/kern/vfs_subr.c
> > +++ b/sys/kern/vfs_subr.c
> > @@ -3713,11 +3713,12 @@ vput(struct vnode *vp)
> >
> >       ASSERT_VOP_LOCKED(vp, __func__);
> >       ASSERT_VI_UNLOCKED(vp, __func__);
> > -     if (!refcount_release(&vp->v_usecount)) {
> > -             VOP_UNLOCK(vp);
> > +     if (refcount_release_if_last(&vp->v_usecount)) {
> > +             vput_final(vp, VPUT);
> >               return;
> >       }
> > -     vput_final(vp, VPUT);
> > +     VOP_UNLOCK(vp);
> > +     vrele(vp);
> >   }
>
> This commit results in reliable hangs for me when I do a 'make -j 16 buildkernel'
> in a VM where the source directory is mounted over NFS.  Even if there is nothing
> to build so the make is just traversing all the directories.  The hung process is
> stuck on a vnode lock.
>
> Some of the stacks from procstat:
>
>    PID    TID COMM                TDNAME              KSTACK
>   1858 100343 make                -                   mi_switch+0x172 sleepq_switch+0x109 sleeplk+0x110 lockmgr_slock_hard+0x3c3 VOP_LOCK1_APV+0x32 nfs_lock+0x2c vop_sigdefer+0x30 VOP_LOCK1_APV+0x32 _vn_lock+0x53 vget_finish+0xaf cache_lookup+0x2be nfs_lookup+0x4e1 vop_sigdefer+0x30 VOP_LOOKUP_APV+0x57 vfs_lookup+0x5aa namei+0x35d vn_open_cred+0x592 openatfp+0x2b7
> root@head:~ # procstat -kk 1860
>   1860 100360 make                -                   mi_switch+0x172 sleepq_switch+0x109 sleeplk+0x110 lockmgr_slock_hard+0x3c3 VOP_LOCK1_APV+0x32 nfs_lock+0x2c vop_sigdefer+0x30 VOP_LOCK1_APV+0x32 _vn_lock+0x53 vget_finish+0xaf cache_lookup+0x2be nfs_lookup+0x4e1 vop_sigdefer+0x30 VOP_LOOKUP_APV+0x57 vfs_lookup+0x5aa namei+0x35d vn_open_cred+0x592 openatfp+0x2b7
> root@head:~ # procstat -kk 1855
>   1855 100314 make                -                   mi_switch+0x172 sleepq_switch+0x109 sleeplk+0x110 lockmgr_xlock_hard+0x3d0 VOP_LOCK1_APV+0x32 nfs_lock+0x2c vop_sigdefer+0x30 VOP_LOCK1_APV+0x32 _vn_lock+0x53 vput_final+0x269 vput+0xab vfs_lookup+0xa7a namei+0x35d vn_open_cred+0x592 openatfp+0x2b7 sys_open+0x28 amd64_syscall+0x169 fast_syscall_common+0xf8
>
> This last one is kind of interesting as it calls vput_final?
>
> (kgdb) p vp
> $2 = (struct vnode *) 0xfffff8016bb6c898
> (kgdb) p vp->v_lock
> $3 = {lock_object = {lo_name = 0xffffffff81467118 <nfs_vnode_tag> "nfs",
>      lo_flags = 117112832, lo_data = 0, lo_witness = 0xfffff8043fd83900},
>    lk_lock = 39, lk_exslpfail = 0, lk_pri = 44, lk_timo = 6}
> (kgdb) p/x vp->v_lock->lk_lock
> $4 = 0x27
>
> So there is one shared lock, and both shared and exclusive waiters.  Seems like
> a vnode lock has been leaked?  Or rather, the vput_final() in this case is
> called from vrele(), so the vput() for 1855 has hit the new path.  I wonder if
> when it dropped its lock via VOP_UNLOCK() it is no longer eligible to re-lock
> it as it needs some other sharer to also unlock before it can re-acquire the
> lock in vput_final?  LK_UPGRADE attempts don't block if there are shared waiters
> but only one 1 shared lock held, but if you do LK_UNLOCK and then LK_XLOCK that
> will indeed honor the other waiters and not acquire the lock, so a vput that
> held the last shared lock would have "worked" to ugprade before even with
> other waiters, but now it can drop the lock and then get stuck.  I guess the
> question though in this case is that some other thread must have snuck in
> and acquired a shared lock between the VOP_UNLOCK and the vrele() and is
> blocked while holding it.
>
> Aha, so most of the other makes are stuck in VOP_LOOKUP on this same vnode
> (which I believe is "/usr/src/sys/dev" as in 1855 the vput() is inside of
> namei() and this is the state of ndp:
>
> (kgdb) p *ndp
> $9 = {
>    ni_dirp = 0x481afdaa2d7d <error: Cannot access memory at address 0x481afdaa2d7d>, ni_segflg = UIO_USERSPACE, ni_rightsneeded = 0xfffffe00d83d2d90,
>    ni_startdir = 0x0, ni_rootdir = 0xfffff80005e461b8, ni_topdir = 0x0,
>    ni_dirfd = -100, ni_lcf = 0, ni_filecaps = {fc_rights = {cr_rights = {0,
>          0}}, fc_ioctls = 0x0, fc_nioctls = -1, fc_fcntls = 0},
>    ni_vp = 0xfffff801bc007000, ni_dvp = 0xfffff8016bb6c898, ni_resflags = 1,
>    ni_debugflags = 3, ni_loopcnt = 0, ni_pathlen = 1,
>    ni_next = 0xfffff8000a0b0817 "", ni_cnd = {cn_flags = 8936259908,
>      cn_cred = 0xfffff80005e53c00, cn_nameiop = LOOKUP, cn_lkflags = 524288,
>      cn_pnbuf = 0xfffff8000a0b0800 "/usr/src/sys/dev/acpica",
>      cn_nameptr = 0xfffff8000a0b0811 "acpica", cn_namelen = 6},
>    ni_cap_tracker = {tqh_first = 0x0, tqh_last = 0xfffffe00d83d2d48},
>    ni_rbeneath_dpp = 0x0, ni_nctrack_mnt = 0x0, ni_dvp_seqc = 0, ni_vp_seqc = 0}
>
> But pid 1844 is a make blocked on another vnode:
>
> (kgdb) p vp
> $24 = (struct vnode *) 0xfffff801bc007000
>
> And that one appears to be the "acpica" child directory:
>
> (kgdb) p *vp->v_cache_dst.tqh_first
> $27 = {nc_src = {le_next = 0xfffff801bc01b9b8, le_prev = 0xfffff801bc01b898},
>    nc_dst = {tqe_next = 0x0, tqe_prev = 0xfffff801bc007058}, nc_hash = {
>      csle_next = 0x0}, nc_dvp = 0xfffff8016bb6c898, n_un = {
>      nu_vp = 0xfffff801bc007000, nu_neg = {neg_flag = 0 '\000',
>        neg_hit = 112 'p'}}, nc_flag = 12 '\f', nc_nlen = 6 '\006',
>    nc_name = 0xfffff801bc01b962 "acpica"}
>
> And note it's nc_dvp is the vnode 1855 is trying to vput and all the other
> makes are trying to VOP_LOCK in nfs_lookup().
>
> So my guess is that 1844 was able to VOP_LOCK() "/usr/src/sys/dev" in the
> window between VOP_UNLOCK() and the vput_final() invoked from vrele().
>
> Humm, except that the vnode in question shouldn't be in vput_final at all??
>
> Back to 1855 (proc doing vput_final):
>
> (kgdb)
> #12 0xffffffff80c9bc89 in vput_final (vp=0xfffff8016bb6c898,
>      func=func@entry=VRELE) at /usr/src/sys/kern/vfs_subr.c:3628
> 3628                            error = vn_lock(vp, LK_EXCLUSIVE | LK_INTERLOCK);
> (kgdb) p vp
> $6 = (struct vnode *) 0xfffff8016bb6c898
> (kgdb) p vp->v_usecount
> $7 = 14
> (kgdb) p vp->v_holdcnt
> $8 = 7
>
> Or did the other make's manage to acquire new use counts while 1855 was blocked
> on the lock in vput_final?
>
> Anyway, it seems 14 make's are blocked on this vnode (/usr/src/sys/dev).  1844
> is blocked on the "acpica" vnode, and another make process (1856) is blocked on
> "acpica", but via getdirentries() instead of namei():
>
> #4  0xffffffff80b62ec0 in sleeplk (lk=lk@entry=0xfffff801bc007070,
>      flags=flags@entry=2098176, ilk=ilk@entry=0xfffff801bc007098,
>      wmesg=wmesg@entry=0xffffffff81467118 <nfs_vnode_tag> "nfs",
>      pri=<optimized out>, pri@entry=44, timo=timo@entry=6, queue=1)
>      at /usr/src/sys/kern/kern_lock.c:302
> #5  0xffffffff80b60fc3 in lockmgr_slock_hard (lk=0xfffff801bc007070,
>      flags=2098176, ilk=0xfffff801bc007098, file=<optimized out>, line=4333,
>      lwa=0x0) at /usr/src/sys/kern/kern_lock.c:693
> #6  0xffffffff811b97a2 in VOP_LOCK1_APV (
>      vop=0xffffffff81aeeb38 <default_vnodeops>, a=a@entry=0xfffffe00d8445c88)
>      at vnode_if.c:2103
> #7  0xffffffff80a6050c in nfs_lock (ap=ap@entry=0xfffffe00d8445c88)
>      at /usr/src/sys/fs/nfsclient/nfs_clvnops.c:344
> #8  0xffffffff80c83d90 in vop_sigdefer (vop=<optimized out>,
>      a=0xfffffe00d8445c88) at /usr/src/sys/kern/vfs_default.c:1502
> #9  0xffffffff811b97a2 in VOP_LOCK1_APV (
>      vop=0xffffffff81aada28 <newnfs_vnodeops>, a=a@entry=0xfffffe00d8445c88)
>      at vnode_if.c:2103
> #10 0xffffffff80cb6b43 in VOP_LOCK1 (vp=0xfffff801bc007000, flags=2098176,
>      file=0xffffffff81359deb "/usr/src/sys/kern/vfs_syscalls.c", line=4333)
>      at ./vnode_if.h:1316
> #11 _vn_lock (vp=vp@entry=0xfffff801bc007000, flags=flags@entry=2098176,
>      file=<unavailable>, line=<unavailable>, line@entry=4333)
>      at /usr/src/sys/kern/vfs_vnops.c:1970
> #12 0xffffffff80cb2d94 in kern_getdirentries (td=0xfffff80006cb5000,
>      fd=<optimized out>,
>      buf=0x2088dba27000 <error: Cannot access memory at address 0x2088dba27000>, count=4096, basep=basep@entry=0xfffffe00d8445df0, residp=residp@entry=0x0,
>      bufseg=UIO_USERSPACE) at /usr/src/sys/kern/vfs_syscalls.c:4333
> #13 0xffffffff80cb32a9 in sys_getdirentries (td=<unavailable>,
>      uap=0xfffff80006cb5428) at /usr/src/sys/kern/vfs_syscalls.c:4287
>
> The "acpica" lock is held by a writer at least:
>
> (kgdb) p vp
> $9 = (struct vnode *) 0xfffff801bc007000
> (kgdb) p vp->v_lock
> $10 = {lock_object = {lo_name = 0xffffffff81467118 <nfs_vnode_tag> "nfs",
>      lo_flags = 117112832, lo_data = 0, lo_witness = 0xfffff8043fd83900},
>    lk_lock = 18446735277917198210, lk_exslpfail = 0, lk_pri = 44, lk_timo = 6}
> (kgdb) p/x vp->v_lock.lk_lock
> $14 = 0xfffff80011ebd782
> (kgdb) set $td = (struct thread *)0xfffff80011ebd780
> (kgdb) p $td->td_tid
> $15 = 100314
> (kgdb) p $td->td_proc->p_comm
> $16 = "make", '\000' <repeats 15 times>
> (kgdb) p $td->td_proc->p_pid
> $17 = 1855
>
> Oh dear, so that's the deadlock.
>
> 1855 needs to get an exclusive lock on "/usr/src/sys/dev/" to finish its vput()
> _while_ it holds an exclusive lock on "/usr/src/sys/dev/acpica".  And the other
> shared lock on "/usr/src/sys/dev" is held by 1844 who is blocked trying to share
> lock "/usr/src/sys/dev/acpica" during lookup().
>
> So by doing the VOP_UNLOCK() and then reacquiring the lock via vrele(), you've
> introduced a vnode LOR as the vput() of the parent directory can now try to
> re-lock the vnode of the parent directory while we hold a vnode on the
> child vnode.
>

This should be easy to fix though.

While the routine must not vrele, the VPUT should be patched to
trylock (aka LK_NOWAIT) and EAGAIN if that fails, which in turn will
punt the vnode to a workqueue.

However, one has to wonder if failing to lock here will be frequent
enough to constitute a problem. There is a counter for these
(deferred_inactive).

The following was only boot-tested:
diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 58975f7ac932..5cde78162e6a 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -3640,11 +3640,8 @@ vput_final(struct vnode *vp, enum vput_op func)
                break;
        case VPUT:
                want_unlock = true;
-               if (VOP_ISLOCKED(vp) != LK_EXCLUSIVE) {
-                       error = VOP_LOCK(vp, LK_UPGRADE | LK_INTERLOCK |
-                           LK_NOWAIT);
-                       VI_LOCK(vp);
-               }
+               error = VOP_LOCK(vp, LK_EXCLUSIVE | LK_INTERLOCK | LK_NOWAIT);
+               VI_LOCK(vp);
                break;
        case VUNREF:
                if (VOP_ISLOCKED(vp) != LK_EXCLUSIVE) {
@@ -3672,12 +3669,9 @@ vput_final(struct vnode *vp, enum vput_op func)
        }
        if (func == VUNREF)
                vp->v_vflag &= ~VV_UNREF;
-       vdropl(vp);
-       return;
 out:
-       if (func == VPUT)
-               VOP_UNLOCK(vp);
        vdropl(vp);
+       return;
 }

 /*
@@ -3713,8 +3707,8 @@ vput(struct vnode *vp)

        ASSERT_VOP_LOCKED(vp, __func__);
        ASSERT_VI_UNLOCKED(vp, __func__);
+       VOP_UNLOCK(vp);
        if (!refcount_release(&vp->v_usecount)) {
-               VOP_UNLOCK(vp);
                return;
        }
        vput_final(vp, VPUT);