[Bug 275594] High CPU usage by arc_prune; analysis and fix

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 28 Mar 2024 04:22:30 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

--- Comment #97 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
Again my apology for the delayed comment.

Now that the nullfs fix (https://reviews.freebsd.org/D44217) has been merged
into stable/13 and stable/14, the next diff is
https://reviews.freebsd.org/D44170, the backport of the
FreeBSD-EN-23:18.openzfs fix to stable/13.  This is essentially the functional
reimplementation of 799e09f75a on main
(https://github.com/openzfs/zfs/commit/799e09f75a31e80a1702a850838c79879af8b917)
and 3ec4ea68d4 on zfs-2.2-release
(https://github.com/openzfs/zfs/commit/3ec4ea68d491a82c8de3360d50032bdecd53608f)
of OpenZFS, focusing at avoiding the pileup on the arc_prune kernel thread.

Among the FreeBSD committers, I have found the names of mav and markj in the
logs of the commits above.  I believe they are suitable for the diff review.

The rest of the diffs:

- https://reviews.freebsd.org/D44171 (kern/vfs: Add the per-filesystem vnode
counter to struct mount.)
- https://reviews.freebsd.org/D44173 (kern/openzfs: Regulate the ZFS ARC
pruning process precisely.)

are challenging because they address the interaction problem between OpenZFS
and the OS (FreeBSD) kernel.  To my belief, the reviewers with the insights on
both OpenZFS and FreeBSD are desired.

If the review on these diffs are too difficult, an alternative is to add a
sysctl(3) toggle that controls the fix feature on D44173 so that the fix can be
merged without enabling it by default.  Thanks to the many testers on this
issue, I now believe the fix is ready for the more extensive public test.

-----

Besides the review, there are quite a few findings regarding the healthy
operation of the OpenZFS ARC and its pruning and eviction, spotted out of my
analysis.  It would be great to document them somehow.  Also, they should be
minded upon reviewing D44171 and D44173.

* OpenZFS ARC buffers and their evictability

- An ARC buffer is separated for reading and writing.
  - A read ARC buffer must be copied into a write ARC buffer in order to
"update" it in the copy-on-write manner.
- A read ARC buffer is not evictable until its content is read from the pool.
- A write ARC buffer is not evictable until its content is written into the
pool.
  - A write ARC buffer depending on the write of another write ARC buffer may
remain unevictable for a long time.
- Under a healthy operation, almost all ARC read and write buffers for data are
evictable.
  - Some part of the ARC read and write buffers for metadata are not evictable
because of their internal dependencies required by the OpenZFS design.
- The write ARC buffers of the vnodes in use (v_usecount > 0) have been found
to remain unevictable until they get no longer in use.
  - This is the direct cause of the excess ARC pruning during
poudriere-bulk(8); the nullfs filesystems cached the OpenZFS vnodes by adding
v_usecount.
  - The similar issue may occur out of a difference cause, eg. too many opened
OpenZFS files.

* Limitations of OpenZFS ARC pruning and eviction on FreeBSD

- The ARC pruning cannot count the OpenZFS znodes (ie FreeBSD vnodes)
unprunable because of the requirements on the OS side.
  - The vnodes with the non-zero v_usecount or v_holdcnt (or both) fall into
such the case.
  - The attempts to recycle such the vnodes causes the long lock upon the
global vnode list.
- The pagedaemon kernel threads may excessively block for the ARC eviction
progress.
  - OpenZFS supports the kernel threads to wait for a desired size of the ARC
eviction progress.
    - The waiting kernel threads are resumed when either the desired ARC
eviction progresses happen or there are no evictable ARC buffers at all.
  - Under a heavy load upon OpenZFS, it often manages to evict the ARC buffers
much smaller, but non-zero, than the desired sizes.
    - The waiting kernel threads can neither meet the desired ARC evicition
progress nor give up quickly.

-- 
You are receiving this mail because:
You are the assignee for the bug.