[Bug 275594] High CPU usage by arc_prune; analysis and fix

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 05 Jan 2024 17:27:20 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

--- Comment #22 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
(In reply to Seigo Tanimura from comment #18)

> It may be good to account the number of the ZFS vnodes not in use.  Before starting an ARC pruning, we can check that count and defer pruning if that is too low.
> (snip)
> Figure out the requirement and design of the accounting above.

Done.


* Sources on GitHub:

- Repo
  - https://github.com/altimeter-130ft/freebsd-freebsd-src
- Branches
  - Fix only
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix
  - Fix and counters
    -
release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters


* New ZFS vnode / znode counters for ARC pruning

- Total ZFS vnodes
  - sysctl(3): vfs.zfs.znode.count
    - Counter variable: zfs_znode_count

- ZFS vnodes in use (v_usecount > 0)
  - sysctl(3): vfs.zfs.znode.inuse
    - Counter variable: zfs_znode_inuse_count

- ARC-prunable ZFS vnodes
  - sysctl(3): vfs.zfs.znode.prunable
    - Formula: zfs_znode_count - zfs_znode_inuse_count

- ARC pruning requests
  - sysctl(3): vfs.zfs.znode.pruning_requested
    - Counter variable: zfs_znode_pruning_requested

- Skipped ARC pruning requests
  - sysctl(3): vfs.zfs.znode.pruning_skipped
    - Counter variable: zfs_znode_pruning_skipped


* Design of counter operations

- Total ZFS vnodes (zfs_znode_count)
  - Increment upon creating a new ZFS vnode.
    - zfs_znode_alloc()
  - Decrement upon reclaiming a ZFS vnode.
    - zfs_freebsd_reclaim()

- ZFS vnodes in use ("v_usecount > 0")
  - Export to the VFS via mnt_fsvninusep of struct mount.
    - Both the VFS and ZFS have to operate on the counter.
    - struct vnode cannot be expanded anymore.
  - Increment upon inserting a new ZFS vnode into a ZFS mountpoint.
    - zfs_mknode()
    - zfs_zget()
  - Increment upon vget() and alike.
    - vget_finish_ref()
  - Decrement upon vput() and alike.
    - vput_final()


* Design of ARC pruning regulation

- Required condition
  - zfs_prune_task(uint64_t nr_to_scan, void *arg __unused)
    - Condition: (zfs_znode_count - zfs_znode_inuse_count) * dn /
zfs_znode_inuse_count >= nr_to_scan
      - dn: The number of the dnodes.
      - Scale the prunable znodes to the dnodes linearly because a znode may
span across multiple dnodes.
    - Call vnlru_free_vfsops() only when the condition above satisfies.

- Other changes on ARC pruning
  - Refactor the extra vnode recycling into 2 togglable features.
    - sysctl(3): vfs.vnode.vnlru.recycle_bufs_pages
      - Recycle the vnodes with the clean buffers and clean/dirty VM pages.
    - sysctl(3): vfs.vnode.vnlru.recycle_nc_src
      - Recycle the vnodes working as the namecache sources.
    - Both enabled by default.
  - Retire the interval between the ARC pruning, the initial fix.
    - The ARC pruning regulation above is more precise.


* Test results

Test Summary:

- Date: 03 Jan 2024 01:30Z - 03 Dec 2023 08:13Z
- Build time: 06:43:25 (317 pkgs / hr)
- Failed port(s): 1
- Setup
  - sysctl(3)
    - vfs.vnode.vnlru.max_free_per_call: 4000000
      - == vfs.vnode.param.limit.
    - vfs.zfs.arc_max: 4294967296
      - 4GB.
    - vfs.zfs.arc.dnode_limit=8080000000
      - 2.5 * (vfs.vnode.param.limit) * sizeof(dnode_t)
        - 2.5: experimental average dnodes per znode (2.0) + margin (0.5)
  - poudriere-bulk(8)
    - USE_TMPFS="wrkdir data localbase"


Result Chart Archive (1 / 2): (poudriere-bulk-2024-01-03_10h30m00s.7z)

- zfs-znodes-and-dnodes.png
  - The counts of the ZFS znodes and dnodes.
- zfs-dnodes-and-freeing-activity.png
  - The freeing activity of the ZFS znodes and dnodes.
- vnode-free-calls.png
  - The calls to the ZFS vnode freeing functions.


Result Chart Archive (2 / 2): (poudriere-bulk-2024-01-03_10h30m00s-zfs-arc.7z)

- zfs-arc/zfs-arc-meta.png
  - The balancing of the ZFS ARC metadata and data.
- zfs-arc/zfs-arc-(A)-(B)-(C).png
  - The ZFS ARC stats.
    (A): MRU (mru) or MFU. (mfu)
    (B): Metadata (metadata) or data (data); the "ghost-" prefix denotes the
evicted cache.
    (C): Size (size) or hits (hits); the hits count the hit sizes, not the hit
counts.


Finding Summary:

- The ZFS ARC meta was lowered strongly, contradicting the high metadata demand
in the ZFS ARC.
  - They are both the designed behaviours.
- The low ZFS ARC meta value triggered the aggressive ARC pruning.
  - Again, this is as designed.
- The ARC pruning regulation throttled the load as expected.
  - Virtually no load happened when only one or two builders were running.
  - The fruitless pruning was eliminated.


Analysis in Detail:

- ZFS znode and dnode counts (zfs-znodes-and-dnodes.png)

The green and blue traces show the counts of the total and in-use ZFS znodes,
respectively.  The gap between these lines denote the prunable ZFS znodes, also
shown as the red trace.  Those traces show that there are almost no prunable
znodes, so it is useless to perform the ARC pruning too often.

- ZFS znode and dnode freeing activity (zfs-dnodes-and-freeing-activity.png)

The red trace is the count of the znodes freed by the ARC pruning.  It worked
in the first hour because the build happened upon many small ports, where the
builder cleaning released many znodes.  After that, the build moved to the big
long ones (lang/rust, lang/gcc12, ...) and the znode release ceased.  A couple
of the occational bumps happened upon the builder cleanups after finishing the
build of such the ports.

- Vnode free calls (vnode-free-calls.png)

The non-zero traces are vfs.zfs.znode.pruning_requested and
vfs.zfs.znode.pruning_skipped, almost completely overlapped.  After 02:45Z,
there were no counts on vfs.vnode.free.* shown by the median.  This means the
ARC pruning was either not performed at all or merely exceptionally.

The magnitude of vfs.zfs.znode.pruning_requested shows the high pressure of the
ARC pruning from ZFS.  The top peak at 02:20Z is ~1.8M / 5 mins == 6K / sec. 
The ARC pruning request definitely needs a solid throttling because a typical
ARC pruning work takes up to ~0.2 seconds when there are actually no prunable
vnodes. [1]  Even under a steady light load in 06:25Z - 08;05Z (working on
emulators/mame, where ccache does not work somehow),
vfs.zfs.znode.pruning_requested recorded ~50K / 5 mins =~ 167 / sec.

[1] Observed under my first fix where the interval of 1 second was enforced
between each ARC pruning.  The max ARC pruning rate was ~0.8 / sec.

- The ZFS ARC stats (zfs-arc/zfs-arc-*.png)

The ZFS ARC stats show how the high pressure upon the ARC pruning happened.

The ZFS ARC stats of the sizes (zfs-arc/zfs-arc-*-size.png) show the following
properties:

  - Except for the first hour, there were almost no evictable sizes.
  - The metadata stayed solidly while the data was driven away.
  - The trace of the ZFS ARC MRU metadata size (zfs-arc-mru-metadata-size.png)
is similar to that of the znode and dnode counts.

Out of these properties, I suspect that the znodes and dnodes in use dominated
the ARC.  Although not confirmed by the code walk, it makes a sense to secure
such the metadata in the memory because they are likely to be updated often.

Another parameter affecting the ZFS ARC is the balancing of the metadata and
data.  The ZFS ARC meta (zfs-arc-meta.png) is the auto-tuned target ratio of
the metadata size in the 32 bit fixed point decimal.  Since vfs.zfs.arc_max is
4GB in my setup, this value can be straightly read as the metadata size target
in bytes.

The ZFS ARC meta is tuned by the ghost-hit sizes. (zfs-arc-*-ghost-*-hits.png) 
It is designed to favour either the metadata or data with the more ghost-hit
sizes, so that the further caching lessens that.  As the data was dominant so
much in the ghost-hit sizes, the ZFS ARC meta was pushed so low; the minimum
was ~197M at 05:20Z, and mostly less than 1G, the default (1/4 of
vfs.zfs.arc_max), except for the first hour.  The low target of the metadata
size then caused the aggressive ARC pruning as implemented in arc_evict(), in
conjunction with the high demand of the unevictable metadata.

-- 
You are receiving this mail because:
You are the assignee for the bug.