Re: vnode_init took over 6 minutes

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Tue, 31 Oct 2023 16:06:08 UTC
On 10/31/23, John F Carr <jfc@mit.edu> wrote:
>
>
>> On Oct 31, 2023, at 11:42, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On 10/31/23, John F Carr <jfc@mit.edu> wrote:
>>>
>>>
>>>> On Oct 31, 2023, at 10:16, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>>
>>>> On 10/31/23, John F Carr <jfc@mit.edu> wrote:
>>>>> During a poudriere build control-T showed an awk process was stuck
>>>>> for over 6 minutes trying to acquire a lock in vnode_init.
>>>>>
>>>>> load: 8.58  cmd: awk 51238 [running] 392.88r 0.00u 392.88s 100% 2604k
>>>>> __mtx_lock_sleep+0xf8 __mtx_lock_flags+0xa4 vnode_init+0xc3
>>>>> keg_alloc_slab+0x277 zone_import+0x143 cache_alloc+0x3ed
>>>>> cache_alloc_retry+0x23 getnewvnode_reserve+0x20 zfs_zget+0x1f
>>>>> zfs_dirent_lookup+0x16d zfs_dirlook+0x7f zfs_lookup+0x3e0
>>>>> zfs_freebsd_cachedlookup+0x74 vfs_cache_lookup+0xa7
>>>>> cache_fplookup_noentry+0x241 cache_fplookup+0x575 namei+0x1ea
>>>>> vn_open_cred+0x48d
>>>>>
>>>>> The stack trace was the same for several minutes.  System CPU time was
>>>>> increasing.
>>>>>
>>>>> Address vnode_init+0xc3 corresponds to the mtx_lock call here.
>>>>>
>>>>> vp->v_holdcnt = VHOLD_NO_SMR;
>>>>> vp->v_type = VNON;
>>>>> mtx_lock(&vnode_list_mtx);
>>>>> TAILQ_INSERT_BEFORE(vnode_list_free_marker, vp, v_vnodelist);
>>>>> mtx_unlock(&vnode_list_mtx);
>>>>> return (0);
>>>>>
>>>>> Address __mtx_lock_sleep+0xf8 is the instruction after a call to
>>>>> lock_delay.
>>>>>
>>>>> ps says the command line was
>>>>>
>>>>> /usr/bin/awk -f /usr/bin/awk
>>>>> old-default/2023-10-31_08h21m03s/.poudriere.builders
>>>>> old-default/2023-10-31_08h21m03s/.poudriere.buildname ...
>>>>>
>>>>> with the full list of input files exceeding the ~2KB command line
>>>>> length
>>>>> limit of ps.
>>>>> "/usr/bin/awk" is probably not the real second argument.  It would
>>>>> cause
>>>>> an
>>>>> immediate syntax error.
>>>>>
>>>>> The hang resolved within a few more minutes and poudriere is
>>>>> continuing
>>>>> happily.
>>>>>
>>>>> I have never seen such behavior before.  Code in vfs_subr.c tries not
>>>>> to
>>>>> hold the
>>>>> vnode_list_mtx lock for too long.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>>
>>>>> FreeBSD 13.2-STABLE up through commit b180f0040f95, 24 core 48 thread
>>>>> Zen
>>>>> 2,
>>>>> zfs pool on SSD connected via 12 Gbps SAS with cache on NVME, 160 GB
>>>>> RAM.
>>>>>
>>>>
>>>> what does "sysctl vfs.vnode" say
>>>>
>>>>
>>>> --
>>>> Mateusz Guzik <mjguzik gmail.com>
>>>
>>> With the builds now complete,
>>>
>>> vfs.vnode.vnlru.uma_reclaim_calls: 0
>>> vfs.vnode.vnlru.kicks: 0
>>> vfs.vnode.vnlru.max_free_per_call: 10000
>>> vfs.vnode.vnlru.failed_runs: 0
>>> vfs.vnode.vnlru.direct_recycles_free: 152871194
>>> vfs.vnode.vnlru.recycles_free: 77190881
>>> vfs.vnode.vnlru.recycles: 0
>>> vfs.vnode.stats.alloc_sleeps: 0
>>> vfs.vnode.stats.free: 1269702
>>> vfs.vnode.stats.skipped_requeues: 9375158
>>> vfs.vnode.stats.created: 235594976
>>> vfs.vnode.stats.count: 1407640
>>> vfs.vnode.param.wantfree: 681059
>>> vfs.vnode.param.limit: 2724237
>>>
>>> Uptime is 11 days, mostly with low I/O rates.
>>>
>>>
>>
>> please add the uma stats: "sysctl vm.uma.VNODE"
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
>
> vm.uma.VNODE.stats.xdomain: 0
> vm.uma.VNODE.stats.fails: 0
> vm.uma.VNODE.stats.frees: 245129082
> vm.uma.VNODE.stats.allocs: 246536724
> vm.uma.VNODE.stats.current: 1407642
> vm.uma.VNODE.domain.0.timin: 215
> vm.uma.VNODE.domain.0.limin: 4613
> vm.uma.VNODE.domain.0.wss: 10080
> vm.uma.VNODE.domain.0.bimin: 391986
> vm.uma.VNODE.domain.0.imin: 391986
> vm.uma.VNODE.domain.0.imax: 391986
> vm.uma.VNODE.domain.0.nitems: 391986
> vm.uma.VNODE.limit.bucket_max: 18446744073709551615
> vm.uma.VNODE.limit.sleeps: 0
> vm.uma.VNODE.limit.sleepers: 0
> vm.uma.VNODE.limit.max_items: 0
> vm.uma.VNODE.limit.items: 0
> vm.uma.VNODE.keg.domain.0.free_slabs: 0
> vm.uma.VNODE.keg.domain.0.free_items: 324187
> vm.uma.VNODE.keg.domain.0.pages: 266134
> vm.uma.VNODE.keg.efficiency: 95
> vm.uma.VNODE.keg.reserve: 0
> vm.uma.VNODE.keg.align: 7
> vm.uma.VNODE.keg.ipers: 8
> vm.uma.VNODE.keg.ppera: 1
> vm.uma.VNODE.keg.rsize: 488
> vm.uma.VNODE.keg.name: VNODE
> vm.uma.VNODE.bucket_size_max: 254
> vm.uma.VNODE.bucket_size: 88
> vm.uma.VNODE.flags: 0xd0000<SMR,FIRSTTOUCH>
> vm.uma.VNODE.size: 488
>
>
>

Ok, this is mostly what I was looking for:
> vm.uma.VNODE.bucket_size: 88

That and vnlru being active.

That codepath which appeared stalled may very well have been making
progress. There is an API deficiency in UMA where it calls the
constructor/destructor for each object separately. Meaning in this
case you got 88 back-to-back lock acquires. Thrown in other CPUs into
the mix which try to do the same thing (or relinquish vnode memory) vs
vnlru you get the cost of waiting on it multiplied big time.

Now vnlru scans happen more frequently than they need to and may hold
up the lock for more than justified, but key point is that this
problem is artificially exacerbated * 88 for each CPU participating.
I'm going to hack something up about it.

-- 
Mateusz Guzik <mjguzik gmail.com>