[Bug 275594] High CPU usage by arc_prune; analysis and fix

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 14 Dec 2023 06:58:31 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

--- Comment #12 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
(In reply to Seigo Tanimura from comment #10)

I have added the fix to enable the extra vnode recycling and tested with the
same setup.

Source on GitHub:
- Repo: https://github.com/altimeter-130ft/freebsd-freebsd-src
- Branches
  - Fix: release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix
  - Counters atop Fix:
release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters

Test setup:
The same as "Ongoing test" in bug #275594, comment #6.

- vfs.vnode.vnlru.max_free_per_call: 4000000 (==
vfs.vnode.vnlru.max_free_per_call)
- vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled)
- vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled)

Build time:
06:50:05 (312 pkgs / hr)

Counters after completing the build, with some remarks:

# The iteration attempts in vnlru_free_impl().
# This includes the retry from the head of vnode_list.
vfs.vnode.free.free_attempt: 33934506866

# The number of the vnodes recycled successfully, including vtryrecycle().
vfs.vnode.free.free_success: 42945537

# The number of the successful recycles in phase 2 upon the VREG (regular file)
vnodes.
# - cleanbuf_vmpage_only: the vnodes held by the clean bufs and resident VM
pages only.
# - cleanbuf_only: the vnodes held by the clean bufs only.
vfs.vnode.free.free_phase2_retry_reg_cleanbuf_vmpage_only: 845659
vfs.vnode.free.free_phase2_retry_reg_cleanbuf_only: 3

# The number of the iteration skips due to a held vnode. ("phase 2" hereafter)
# NB the successful recycles in phase 2 are not included.
vfs.vnode.free.free_phase2_retry: 8923850577

# The number of the phase 2 skips upon the VREG vnodes.
vfs.vnode.free.free_phase2_retry_reg: 8085735334

# The number of the phase 2 skips upon the VREG vnodes in use.
# Almost all phase 2 skips upon VREG fell into this.
vfs.vnode.free.free_phase2_retry_reg_inuse: 8085733060

# The number of the successful recycles in phase 2 upon the VDIR (directory)
vnodes.
# - free_phase2_retry_dir_nc_src_only: the vnodes held by the namecache entries
only.
vfs.vnode.free.free_phase2_retry_dir_nc_src_only: 2234194

# The number of the phase 2 skips upon the VDIR vnodes.
vfs.vnode.free.free_phase2_retry_dir: 834902819

# The number of the phase 2 skips upon the VDIR vnodes in use.
# Almost all phase 2 skips upon VDIR fell into this.
vfs.vnode.free.free_phase2_retry_dir_inuse: 834902780

Other findings:

- The behaviour upon the arc_prune thread CPU usage was mostly the same.
  - The peak reduced just a few percents, not likely to be the essential fix.

- The namecache hit ratio degraded about 10 - 20%.
  - Maybe the recycled vnodes are looked up again, especially the directories.

-----

The issue still exists essentially with the extra vnode recycle.  Maybe the
root cause is in ZFS rather than the OS.

There are some suspicious findings on the in-memory dnode behaviour during the
tests so far:

- vfs.zfs.arc_max does not enforce the max size of
kstat.zfs.misc.arcstats.dnode_size.
  - vfs.zfs.arc_max: 4GB
  - vfs.zfs.arc.dnode_limit_percent: 10 (default)
  - sizeof(struct dnode_t): 808 bytes
    - Found by "vmstat -z | grep dnode_t".
  - kstat.zfs.misc.arcstats.arc_dnode_limit: 400MB (default,
vfs.zfs.arc.dnode_limit_percent percent of vfs.zfs.arc_max)
    - ~495K dnodes.
  - kstat.zfs.misc.arcstats.dnode_size, max: ~ 1.8GB
    - ~2.2M dnodes.
    - Almost equal to the max observed number of the vnodes.

- The dnode_t zone of uma(9) does not have the limit.

From above, the number of the in-memory dnodes looks like the bottleneck. 
Maybe the essential solution is to configure vfs.zfs.arc.dnode_limit explicitly
so that ZFS can hold all dnodes required by the application in the memory.

-- 
You are receiving this mail because:
You are the assignee for the bug.