[Bug 275594] High CPU usage by arc_prune; analysis and fix

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 27 Dec 2023 10:31:56 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

--- Comment #18 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
(In reply to Seigo Tanimura from comment #16)

* The results of the comparision between the estimated ZFS open files and
kern.openfiles

Test Summary:

- Date: 26 Dec 2023 00:50Z - 26 Dec 2023 06:42Z
- Build time: 06:41:18 (319 pkgs / hr)
- Failed ports: 4
- Setup
  - vfs.vnode.vnlru.max_free_per_call: 4000000 (==
vfs.vnode.vnlru.max_free_per_call)
  - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled)
  - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled)
  - vfs.zfs.arc.dnode_limit=2684354560 (2.5G, larger than the max actual value
observed so far)

Results:

* Estimated ZFS open files

         | (A)                        | (B)                          | (C)
         |                            | Phase 2 regular file retries |
         |                            | (Estimated ZFS open files    | ZFS open
files
UTC Time | Vnode free call period [s] | seen by vnlru_free_impl())   |
estimated by kern.openfiles
=========+============================+==============================+=================================
  02:00Z |                       1.27 |                          354 |         
                   491
---------+----------------------------+------------------------------+---------------------------------
  03:00Z |                       1.32 |                          411 |         
                   439
---------+----------------------------+------------------------------+---------------------------------
  04:00Z |                       1.35 |                          477 |         
                   425
---------+----------------------------+------------------------------+---------------------------------
  05:00Z |                       1.69 |                          193 |         
                   242
---------+----------------------------+------------------------------+---------------------------------
  06:00Z |                       1.88 |                          702 |         
                   232
---------+----------------------------+------------------------------+---------------------------------
  07:00Z |                       1.54 |                          299 |         
                   237

where

(A): 1 / ((vnode free calls) / (5 * 60))

(5 * 60) is the time granularity on the chart in seconds.  This applies to (B)
as well.

(B): (number of retries) / (5 * 60) * (A)

(C): 0.7 * (kern.openfiles value)

0.7 is the observer general ratio of the ZFS vnodes in the kernel. (bug #275594
comment #16)

* Chart archive: poudriere-bulk-2023-12-26_09h50m17s.7z
* Charts: zfs-vnode-free-calls.png, zfs-vnode-recycle-phase2-reg-retries.png,
kernel-open-files.png.

(B) and (C) sometimes match on the most significant figure, and do not in the
other times.  Out of these results, I understand that the unrecyclable ZFS
vnodes are caused by opening them in an indirect way.  The detail of the
"indirect" way is discussed next.

-----

* The ZFS vnodes in use by nullfs(5)

Nullfs(5) involved in my poudriere jail setup is now suspected for the
unrecyclable ZFS vnodes.

My poudriere setup uses "-m null" on the poudriere jail.

> root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head # poudriere jail -l
> release-13_2_0 13.2-RELEASE amd64 null   2023-04-13 03:14:26 /home/poudriere.jailroot/release-13.2.0
> release-14_0_0 14.0-RELEASE amd64 null   2023-11-23 15:14:17 /home/poudriere.jailroot/release-14.0.0
> root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head # 

Under this setup, poudriere-bulk(8) mounts the jail filesystems onto each
builder by nullfs(5).  A nullfs(5) vnode adds one v_usecount to the lower vnode
(asserted in null_nodeget()) so that the pointer to the lower vnode does not
dangle.  This lasts even after the nullfs(5) vnode is inactivated and put onto
the free list, until the nullfs(5) vnode gets reclaimed.

The nullfs(5) design above explains the results of the estimation upon the
unrecyclable ZFS vnodes.  As the builders open the more files in ZFS via
nullfs(5), the more unrecyclable ZFS vnodes are made.  In the detail, however,
the estimation makes the errors because the multiple builders can open the same
ZFS file.

The massive free of the vnodes after the build is also explained by the
nullfs(5) design.  The cleanup of the builder filesystems dismisses a lot of
nullfs(5) vnodes, which, in turn, drops v_usecount of the lower ZFS vnodes so
that they can be evicted.

-----

The finding above introduces a new question: should the ZFS vnodes used by
nullfs(5) be recycled?

My answer is no.  The major hurdle is the search of the vnode stacking link. 
It is essentially a tree with the ZFS (or any non-nullfs(5)) vnode as the root,
spanning to multiple nullfs(5) vnode leaves and depth levels.  The search is
likely to be even more complex than the linear scanning of the vnode list.

In addition, all vnodes in the tree must be recyclable for the ZFS vnode at the
tree root to be recyclable as well.  This is likely to put a complex dependency
for the ZFS vnode recycling.

-----

My investigation so far, including this one, has proven that it costs too much
to scan over all vnodes without any positive estimation in advance.  We need a
way to check if the ARC pruning will yield the fruitful result in the way much
cheaper than the vnode scan.

It may be good to account the number of the ZFS vnodes not in use.  Before
starting an ARC pruning, we can check that count and defer pruning if that is
too low.  This has already been implemented in arc_evict_impl() for the
eviction of the ARC data and metadata by checking the evictable size.  The ARC
data and metadata eviction is skipped if there are zero evictable bytes.

* My next work

Figure out the requirement and design of the accounting above.

-- 
You are receiving this mail because:
You are the assignee for the bug.