[Bug 275594] High CPU usage by arc_prune; analysis and fix

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 25 Dec 2023 10:21:27 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

--- Comment #14 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
(In reply to Seigo Tanimura from comment #12)

Thanks for your patience, I have spent quite some days to set up the more
precise and easy measurement by Fluent Bit, Elasticsearch and Kibana.  It is
now working somehow and ready to share the test results and the analysis on
them.

-----

Sources on GitHub:

- FreeBSD
  - Repo
    - https://github.com/altimeter-130ft/freebsd-freebsd-src
  - Branches
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix
          - No changes since the last test.
    -
release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters
          - Add the vnode tag counting. (disabled by default)
          - Refactor the ZFS ARC stats for the precise investigation.
          - The changes may be squashed or removed in the future.

- Fluent Bit
  - Repo
    - https://github.com/altimeter-130ft/fluent-fluent-bit
  - Branch
    - topic-plugin-in-sysctl
          - The input plugin for sysctl(3).
          - The other fixes included.
          - No documents yet.
  - Note
    - "-DFLB_CORO_STACK_SIZE=1048576" recommended on cmake(1).
          - The default coroutine thread stack size is only a few kilobytes on
FreeBSD.

-----

Test Summary:

- Date: 21 Dec 2023 17:50Z - 22 Dec 2023 00:50Z
- Build time: 07:00:59 (304 pkgs / hr)
- Failed ports: 3
- Setup
  - vfs.vnode.vnlru.max_free_per_call: 4000000 (==
vfs.vnode.vnlru.max_free_per_call)
  - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled)
  - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled)
  - vfs.zfs.arc.dnode_limit=2684354560 (2.5G, larger than the max actual value
observed so far)

-----

Findings Summary:

- Vnode Behavior
  - The vnodes circulate between the creation and pruning.
  - Almost all of the unrecyclable vnodes have one use count, which makes one
extra hold count as well.

- ZFS Behaviour
  - The direct trigger of the ARC pruning is the low ARC meta value.
  - There is a massaive amount of the ghost data hits.

-----

Finding Detail:

The discussion below refer to the charts in the archive attached as
poudriere-bulk-2023-12-22_02h55m25s.

The times on the chart horizontal axes are in the UTC.

- Vnode Behavior
  - The vnodes circulate between the creation and pruning.

* Charts: number-of-vnodes.png, zfs-vnode-free-calls.png, zfs-freed-vnodes.png.

Although zfs-freed-vnodes.png shows the steady vnode recycling in some time
sections, the number of the vnodes is almost flat. (Typical case: 23:20Z -
00:40Z)  This means that the vnodes are both recycled and created at almost the
same rate.

  - Almost all of the unrecyclable vnodes have one use count, which makes one
extra hold count as well.

* Charts: zfs-vnode-recycle-phase2-reg-retries.png,
zfs-vnode-recycle-phase2-dir-retries.png.

The non-zero traces on the charts are the overlaps of:
  - free_phase2_retry_*: the skipped vnodes,
  - free_phase2_retry_*_inuse: the vnodes with the non-zero v_usecount, and
  - free_phase2_retry_*_unknown1: the vnodes with one unknown extra v_holdcnt.

The non-zero v_usecount is sufficient to explain the "unknown" extra one
v_holdcnt.  The current vnode recycling code does not work on the vnodes with
the non-zero v_usecount, and one v_usecount makes one v_holdcnt on FreeBSD.

- ZFS Behaviour
  - The direct trigger of the ARC pruning is the low ARC meta value.

* Charts: zfs-arc-meta.png, zfs-arc-dnode-size.png.

arc_evict() triggers the async ARC pruning when either of the following
conditions stand:

A) More than 3/4 of the wanted metadata size target in the ARC is not
evictable, or
B) The size allocated for the dnode exceeds vfs.zfs.arc.dnode_limit.

B) is not the case in this test; the max ZFS ARC dnode size was ~1.6GB, which
was less than vfs.zfs.arc.dnode_limit. (2.5GB)

The ZFS ARC meta is the ratio of the wanted metadata size with respect to the
whole ARC size, initialized to 25% scaled up to 1G as the 32bit fixed point
decimal.  The higher (lower) value means the more (less) size for the ARC
metadata.  The reason for the sharp drop of this value is discussed next.

  - There is a massaive amount of the ghost data hits.

* Charts: zfs-arc-mru-ghost-data-hits.png, zfs-arc-mfu-ghost-data-hits.png,
zfs-arc-mru-ghost-metadata-hits.png, zfs-arc-mfu-ghost-metadata-hits.png.

Please watch out for the vertical axes of the charts:
- The "hits" are the sizes, not the counts in ZFS.
- The scales of the data hits are about 10 times larger than the metadata hits.

The ZFS ARC meta value is automatically tuned by the sizes of the ghost hits
passed to arc_evict_adj(), called in arc_evict().  The idea is to favour either
the data or metadata experiencing the more ghost hits.

The ghost hits on the MFU and MRU data is about 10 times larger than their
counterparts on the metadata.  This explains the sharp drop of the ZFS ARC meta
starting at about 18:20Z.  Also, the large size of the ghost data hits happened
occasionally, while the ghost metadata hits seldom exceeded the data.  This
shows that the ZFS ARC meta stayed low until the end of the build test.

-----

Next Work:

- Check if the vnodes with the non-zero v_usecount are actually open(2)ed.
  - Compare the count to kern.openfiles.
  - Estimation out of the current results:
    - Max free_phase2_retry_reg_inuse per 10 minutes: ~270K @ 00:20Z.
    - The ZFS vnode free call period at the time above: ~1.57 [s]. (<= 1 /
(382.5 / (10 * 60)))
    - The hits into the vnodes in use per ZFS vnode free call: ~707. (<= 270K /
(10 * 60) * 1.57)
    - If this figure is reasonably close to kern.openfiles, the phase2 retries
are indeed caused by the open(2)ed vnodes and such the vnodes cannot be
evicted.

-- 
You are receiving this mail because:
You are the assignee for the bug.