[Bug 253992] ufsdirhash_build() sleeps with directory vnode locked due to M_WAITOK

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Wed Mar 3 18:18:55 UTC 2021


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253992

            Bug ID: 253992
           Summary: ufsdirhash_build() sleeps with directory vnode locked
                    due to M_WAITOK
           Product: Base System
           Version: 12.1-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs at FreeBSD.org
          Reporter: dgmorris at earthlink.net

On a system where a timeout panic has been added for attempts to acquire a UFS
root vnode lock, such a panic was observed. The holder of the lock was
attempting to do:

ched_switch() at sched_switch+0xbcc/frame 0xfffffe885e4e5e10
mi_switch() at mi_switch+0x29b/frame 0xfffffe885e4e5e60
sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe885e4e5e90
_sleep() at _sleep+0x265/frame 0xfffffe885e4e5f10
vm_wait_doms() at vm_wait_doms+0xfd/frame 0xfffffe885e4e5f40
vm_wait_domain() at vm_wait_domain+0x50/frame 0xfffffe885e4e5f70
vm_domain_alloc_fail() at vm_domain_alloc_fail+0x91/frame 0xfffffe885e4e5fb0
vm_page_alloc_domain_after() at vm_page_alloc_domain_after+0x243/frame 
0xfffffe885e4e6030
uma_small_alloc() at uma_small_alloc+0x9b/frame 0xfffffe885e4e6080
keg_alloc_slab() at keg_alloc_slab+0x103/frame 0xfffffe885e4e60d0
zone_import() at zone_import+0x106/frame 0xfffffe885e4e6170
zone_alloc_item() at zone_alloc_item+0x6e/frame 0xfffffe885e4e61b0
cache_alloc_retry() at cache_alloc_retry+0x13c/frame 0xfffffe885e4e6210
uma_zalloc_arg() at uma_zalloc_arg+0x114/frame 0xfffffe885e4e6270
ufsdirhash_build() at ufsdirhash_build+0x850/frame 0xfffffe885e4e6310
ufs_lookup_ino() at ufs_lookup_ino+0x1a2/frame 0xfffffe885e4e6420
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x8a/frame 0xfffffe885e4e6450
vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfffffe885e4e64b0
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x8a/frame 0xfffffe885e4e64e0
lookup() at lookup+0x756/frame 0xfffffe885e4e6590
namei() at namei+0x46c/frame 0xfffffe885e4e6660
_vn_open_cred() at _vn_open_cred+0x1e1/frame 0xfffffe885e4e67f0
_vn_open() at _vn_open+0x1f/frame 0xfffffe885e4e6820
isi_kern_openat() at isi_kern_openat+0x3b4/frame 0xfffffe885e4e6a10
sys_openat() at sys_openat+0x5a/frame 0xfffffe885e4e6a70
amd64_syscall() at amd64_syscall+0x381/frame 0xfffffe885e4e6bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe885e4e6bf0
— syscall (499, FreeBSD ELF64, sys_openat), rip = 0x8005b0d3a, rsp = 
0x7fffffffe368, rbp = 0x7fffffffe3e0 —

The system is ccNUMA, with 2 domains. One domain is OOM, the other is not:
db> show pageq
pq_free 1887153
dom 0 page_cnt 8102741 free 2 pq_act 57 pq_inact 1 pq_laund 0 pq_unsw 311370
dom 1 page_cnt 8133281 free 1887151 pq_act 17391 pq_inact 4868531 
pq_laund 0 pq_unsw 186545

The DIRHASH uma zone is created with flags argument of 0, so acquires the
default NUMA policy (FIRSTTOUCH). The problem allocation is on line 437:

        dh->dh_hash = malloc(narrays * sizeof(dh->dh_hash[0]),
            M_DIRHASH, M_NOWAIT | M_ZERO);
        if (dh->dh_hash == NULL)
                goto fail;
        dh->dh_blkfree = malloc(nblocks * sizeof(dh->dh_blkfree[0]),
            M_DIRHASH, M_NOWAIT);
        if (dh->dh_blkfree == NULL)
                goto fail;
        for (i = 0; i < narrays; i++) {
                if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_WAITOK()) == NULL)
                        goto fail;
                for (j = 0; j < DH_NBLKOFF; j++)
                        dh->dh_hash[i][j] = DIRHASH_EMPTY;
        }

Where:
#define DIRHASH_BLKALLOC_WAITOK()       uma_zalloc(ufsdirhash_zone, M_WAITOK)
 from line 73.

As can be seen, the code is even assuming the allocation can fail as if it were
NOWAIT, but is instead doing M_WAITOK while holding the directory vnode lock
per sys/ufs/ufs/ufs_lookup.c line 254:
#ifdef DEBUG_VFS_LOCKS
        /*
         * Assert that the directory vnode is locked, and locked
         * exclusively for the last component lookup for modifying
         * operations.
         *
         * The directory-modifying operations need to save
         * intermediate state in the inode between namei() call and
         * actual directory manipulations.  See fields in the struct
         * inode marked as 'used during directory lookup'.  We must
         * ensure that upgrade in namei() does not happen, since
         * upgrade might need to unlock vdp.  If quotas are enabled,
         * getinoquota() also requires exclusive lock to modify inode.
         */
        ASSERT_VOP_LOCKED(vdp, "ufs_lookup1");
        if ((nameiop == CREATE || nameiop == DELETE || nameiop == RENAME) &&
            (flags & (LOCKPARENT | ISLASTCN)) == (LOCKPARENT | ISLASTCN))
                ASSERT_VOP_ELOCKED(vdp, "ufs_lookup2");
#endif

restart:
        bp = NULL;
        slotoffset = -1;

        /*
         * We now have a segment name to search for, and a directory to search.
         *
         * Suppress search for slots unless creating
         * file and at end of pathname, in which case
         * we watch for a place to put the new file in
         * case it doesn't already exist.
         */
        ino = 0;
        i_diroff = dp->i_diroff;
        slotstatus = FOUND;
        slotfreespace = slotsize = slotneeded = 0;
        if ((nameiop == CREATE || nameiop == RENAME) &&
            (flags & ISLASTCN)) {
                slotstatus = NONE;
                slotneeded = DIRECTSIZ(cnp->cn_namelen);
        }

#ifdef UFS_DIRHASH
        /*
         * Use dirhash for fast operations on large directories. The logic
         * to determine whether to hash the directory is contained within
         * ufsdirhash_build(); a zero return means that it decided to hash
         * this directory and it successfully built up the hash table.
         */
        if (ufsdirhash_build(dp) == 0) {

The NUMA system may be more likely to hit this in the case of imbalanced load
(as the system as a whole may not be OOM, just the domain), but the risk is
there on any system.

I propose the following diff as the WAITOK allocation is only used in one
place:

dmorris at wkstn-dmorris2:freebsd_ufs $ git diff -p origin/main
diff --git a/sys/ufs/ufs/ufs_dirhash.c b/sys/ufs/ufs/ufs_dirhash.c
index 13094b5a1c97..72f38be5eea4 100644
--- a/sys/ufs/ufs/ufs_dirhash.c
+++ b/sys/ufs/ufs/ufs_dirhash.c
@@ -108,7 +108,7 @@ static uma_zone_t   ufsdirhash_zone;

 #define DIRHASHLIST_LOCK()             mtx_lock(&ufsdirhash_mtx)
 #define DIRHASHLIST_UNLOCK()           mtx_unlock(&ufsdirhash_mtx)
-#define DIRHASH_BLKALLOC_WAITOK()      uma_zalloc(ufsdirhash_zone, M_WAITOK)
+#define DIRHASH_BLKALLOC_NOWAIT()      uma_zalloc(ufsdirhash_zone, M_NOWAIT)
 #define DIRHASH_BLKFREE(ptr)           uma_zfree(ufsdirhash_zone, (ptr))
 #define        DIRHASH_ASSERT_LOCKED(dh)                                      
\
     sx_assert(&(dh)->dh_lock, SA_LOCKED)
@@ -425,7 +425,7 @@ ufsdirhash_build(struct inode *ip)
        if (dh->dh_blkfree == NULL)
                goto fail;
        for (i = 0; i < narrays; i++) {
-               if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_WAITOK()) == NULL)
+               if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_NOWAIT()) == NULL)
                        goto fail;
                for (j = 0; j < DH_NBLKOFF; j++)
                        dh->dh_hash[i][j] = DIRHASH_EMPTY;

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the freebsd-bugs mailing list