Re: Did something change with ZFS and vnode caching?

From: Garrett Wollman <wollman_at_bimajority.org>
Date: Sun, 10 Sep 2023 02:28:04 UTC
<<On Fri, 1 Sep 2023 01:04:56 +0200, Mateusz Guzik <mjguzik@gmail.com> said:

> zfs lock arrays are a known problem, bumping them is definitely an option.

This is the thing I tried next.  It took a few attempts (mostly I
think due to my errors) but I'm now running with 512 (instead of 64)
and plan to deploy 1024 soon, as the results are significant: while we
still see significant loads and kmem pressure during the backup
window, backups are able to complete some 5 to 8 hours sooner, and
nfsd remains responsive.

> dtrace is rather funky with stack unwinding sometimes, hence possibly
> misplaced lock_delay.

> What you should do here is recompile the kernel with LOCK_PROFILING.

> Then:
> sysctl debug.lock.prof.contested_only=1
> sysctl debug.lock.prof.enable=1

> And finally sysctl debug.lock.prof.stats > out.lockprof

I do not know if I will get around to doing this, since my users have
a limited tolerance for outages that I'm probably nearing the end of,
but it does seem likely to be the next step if we continue to have
problems.

Last night I was able to get a new dtrace capture, and I made a flame
graph *excluding* stacks involving sleepq_catch_signals (which
indicate threads that are idle).  I redid the previous flame graph
with a similar filter.  Unfortunately, this cannot be an entirely
apples-to-apples comparison, because the traces ran for different
times and both the NFS clients do different (unpredictable) work every
night.  Last night's trace was conducted during a period when four
backup processes were running simultaneously, each with up to 110
threads, but a much shorter capture overall than the previous trace.

Last week: <https://people.csail.mit.edu/wollman/dtrace-both-2r.svg>
Yesterday: <https://people.csail.mit.edu/wollman/dtrace-both-11r.svg>

The one thing that stands out to me is that _sx_xlock_hard *barely*
shows up in the yesterday's graph -- it's there, but you have to know
where to search for it.  On the other hand, lock_delay is still there,
and still missing its immediate caller's stack frame, and the vnode
list mutex is still obviously quite contended.  The fact that
__mtx_lock_hard is relatively a larger fraction of zfs_zget now
suggests that increating the number of ZFS object locks has
substantially reduced the amount of self-contention the backup client
creates during its scan.

Tonight, I took a similar trace on a stock 13.2-RELEASE system:
<https://people.csail.mit.edu/wollman/dtrace-14.svg>.

Note that this system has very different activity patterns, and is
much newer and higher capacity but has a different
(capacity-optimized) zpool setup; at times, there were as many as
eight backup processes running, and on this machine it takes about 21
hours to complete nightly incrementals.  What stands out, aside from
the additional time waiting for I/O to complete, is the appearance of
rms_rlock_fallback.  The path for this is the ZFS_ENTER macro called
at ZFS vnode entry points to interlock with unmount operations.

Once I have the new kernel deployed on this server (and 15 others)
I'll be able to collect more data and see if it's worth investigating
those lock_delay() stacks.

-GAWollman