[Bug 275594] High CPU usage by arc_prune; analysis and fix

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 11 Dec 2023 08:45:26 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

--- Comment #10 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
(In reply to Mark Johnston from comment #9)

> vnodes live on a global list, chained by v_vnodelist, and this list appears to be used purely for reclamation.

The free vnodes are indeed chained to vnode_list in sys/kern/vfs_subr.c, but
this "free" means "not opened by any user processes," ie vp->v_usecount > 0.

Besides the user processes, the kernel may use a "free" vnode on its own
purpose.  In such the case, the kernel "holds" the vnode by vhold(9), making
vp->v_holdcnt > 0.  A vnode held by the kernel in this way cannot be recycled
even if it is not opened by the user process.

vnlru_free_impl() checks if the vnode in question is held, and skips recycling
if so.  I have seen, out of the tests so far, that vnlru_free_impl() tends to
skip many vnodes, especially during the late phase of "poudriere bulk".  The
results and findings are shown at the end of this comment.

-----

> If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then it does nothing to address its targets; it may as well do nothing.

Again, the mixed use of tmpfs and ZFS has actually turned out as rather a minor
problem.  Please refer to my findings.

Also, there are some easier workarounds that can be tried first, if this is
really the issue:

- Perform the test of vp->v_mount->mnt_op before vp->v_holdcnt.  This should
work somehow for now because ZFS is the only filesystem that calls
vnlru_free_vfsops() with the valid mnt_op.
- After a preconfigured number of consecutive skips, move the marker vnode to
the restart point, release vnode_list_mtx and yield the CPU.  This actually
happens when a vnode is recycled, which may block.

> Suppose that arc_prune is disabled outright.  How does your test fare?

Difficult to tell.  I am sure the ARC size should keep increasing first, but
cannot tell if it eventually comes to an equilibrium point because of the
builder cleanup or keeps rising.

-----

In order to investigate the detail of the held vnodes found in
vnlru_free_impl(), I have conducted another test with some additional counters.

Source on GitHub:
- Repo:
https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters
- Branch:
release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters

Test setup:
The same as "Ongoing test" in bug #275594, comment #6.

- vfs.vnode.vnlru.max_free_per_call: 4000000 (==
vfs.vnode.vnlru.max_free_per_call)
- vfs.zfs.arc.prune_interval: 1000 (my fix enabled)

Build time:
06:32:57 (325 pkgs / hr)

Counters after completing the build, with some remarks:
# The iteration attempts in vnlru_free_impl().
# This includes the retry from the head of vnode_list.
vfs.vnode.free.free_attempt: 29695926809

# The number of the vnodes recycled successfully, including vtryrecycle().
vfs.vnode.free.free_success: 30841748

# The number of the iteration skips due to a held vnode. ("phase 2" hereafter)
vfs.vnode.free.free_phase2_retry: 11909948307

# The number of the phase 2 skips upon the VREG (regular file) vnodes.
vfs.vnode.free.free_phase2_retry_reg: 7877197761

# The number of the phase 2 skips upon the VBAD (being recycled) vnodes.
vfs.vnode.free.free_phase2_retry_bad: 3101137010

# The number of the phase 2 skips upon the VDIR (directory) vnodes.
vfs.vnode.free.free_phase2_retry_dir: 899106296

# The number of the phase 2 skips upon the VNON (being created) vnodes.
vfs.vnode.free.free_phase2_retry_non: 2046379

# The number of the phase 2 skips upon the doomed (being destroyed) vnodes.
vfs.vnode.free.free_phase2_retry_doomed: 3101137196

# The number of the iteration skips due to the filesystem mismatch. ("phase 3"
hereafter)
vfs.vnode.free.free_phase3_retry: 17755077891

Analysis and Findings:
Out of ~30G iteration attempts in vnlru_free_impl(), ~12G failed in phase 2. 
(Phase 3 failure is ~18G, but there are some workaround ideas shown above)

Among the phase 2 failures, the most dominant vnode type is VREG.  On this
type, I suspect the residential VM pages alive in the kernel; a VM object holds
the backend vnode if the object has at least one page allocated out of it. 
Please refer to vm_page_insert_after() and vm_page_insert_radixdone() for the
implementation.

Technically, such the vnodes can be recycled as long as the prerequisites
checked in vtryrecycle() are met with the sufficient locks, which does not
include the residential VM pages.  vnode_destroy_vobject(), called in vgonel(),
takes care of those pages.  I suppose we have to do this if the more work is
required on vnlru_free_impl(), maybe during the retry after reaching the end of
vnode_list.

The further fix above assumes that ZFS takes the appropriate work to reduce the
ARC size upon reclaiming a ZFS vnode.

The rest of the cases are either difficult or impossible for any further work.

A VDIR vnode is held by the name cache to improve the path resolution
performance, both forward and backward.  While the vnodes of this kind can be
reclaimed somehow, a significant performance penalty is expected upon the path
resolution.

VBAD and VNON are actually the states rather than the types of the vnodes. 
Both of the states are not eligible for recycling by design.

-- 
You are receiving this mail because:
You are the assignee for the bug.