Re: RFC: How ZFS handles arc memory use

From: Doug Ambrisko <ambrisko_at_ambrisko.com>
Date: Wed, 29 Oct 2025 21:06:48 UTC
It seems around the switch to OpenZFS I would have arc clean task running
100% on a core.  I use nullfs on my laptop to map my shared ZFS /data
partiton into a few vnet instances.  Over night or so I would get into
this issue.  I found that I had a bunch of vnodes being held by other
layers.  My solution was to reduce kern.maxvnodes and vfs.zfs.arc.max so
the ARC cache stayed reasonable without killing other applications.

That is why a while back I added the vnode count to mount -v so that
I could see the usage of vnodes for each mount point.  I made a script
to report on things:

 #!/bin/sh

 (
 sysctl kstat.zfs.misc.arcstats.arc_prune
 sysctl kstat.zfs.misc.arcstats.arc_raw_size
 sysctl kstat.zfs.misc.arcstats.c_max
 sysctl vfs.zfs.arc_max
 sysctl vfs.wantfreevnodes
 sysctl vfs.freevnodes
 sysctl vfs.vnodes_created
 sysctl vfs.numvnodes
 sysctl vfs.vnode.vnlru.kicks
 sysctl vfs.numvnodes
 ) | awk '{ printf "%40s %10d\n", $1, $2 }'

 mount -v | while read fs_type junk mount params
 do
        count=`echo "$params" | sed 's/^.*count //' | sed 's/ [\)]//'`
        #echo "$fs_type, $mount, $params"
        echo "$count, $fs_type, $mount"
 done | awk '{ printf "%10d %5s %s\n", $1, $2, $3 }' | sort -nr | head

With a reduced vnode max and arc max my system stays reasonably responsive
over weeks or longer assuming I can suspend and resume without a crash.

Thanks,

Doug A.

On Wed, Oct 22, 2025 at 10:46:50PM +0200, Peter Eriksson wrote:
| I too am seeing this issue with some of our FreeBSD 14.n running “backup” servers. However they are not NFS servers - they are rsync targets backing up the user-facing NFS(&SMB) servers. 
| 
| Every now and then (once a month) I see the load numbers spiking and disk I/O going thru the roof - and then performance grinds to a virtual halt (responds extremely slowly to shells/console I/O) for many hours, before it resolves itself (partly). A reboot fixes the issue better though, and then it runs fine for a couple of months before it happens again.
| 
| Our user-facing (NFS & SMB-serving) servers are all still running FreeBSD 13.x though in order to avoid running into this issue - but eventually we will be forced to upgrade to 14 when 13 gets EOLd next year (so it would be nice if this issue could be fixed before that :-)... 
| 
| Our 14.3-running servers involved have each 512-768GB of RAM, many many (22k on one 111k on the other) ZFS filesystems and many files and directories (somewhere around 1G on one 2.5G on the other). 
| 
| I’ve seen some posts that this issue (or something similar) has been seen in the Linux world too and there is (I think) was related to inodes/vnodes directory caching causing pinned memory being allocated that couldn’t be released easily - and since my rsync backup servers will scan all files/directories when doing backups this sounds plausible…
| 
| Here’s a graf showing some measurements when this happens - the grafs aren’t perfectly aligned but things starts at 22:00 when the backups start - when the load number starts to spike, and at the same time something causes the vfs_numvnodes counter to drop drastically. The ZFS memory usage doesn’t seem to be happening much at that time though… We measure a lot of different values so let me know if there’s some other number someone would like so see :-)
| 


| 
| 
| - Peter
| 
| > On 22 Oct 2025, at 16:34, Rick Macklem <rick.macklem@gmail.com> wrote:
| > 
| > Hi,
| > 
| > A couple of people have reported problems with NFS servers,
| > where essentially all of the system's memory gets exhausted.
| > They see the problem on 14.n FreeBSD servers (which use the
| > newer ZFS code) but not on 13.n servers.
| > 
| > I am trying to learn how ZFS handles arc memory use to try
| > and figure out what can be done about this problem.
| > 
| > I know nothing about ZFS internals or UMA(9) internals,
| > so I could be way off, but here is what I think is happening.
| > (Please correct me on this.)
| > 
| > The L1ARC uses uma_zalloc_arg()/uma_zfree_arg() to allocate
| > the arc memory. The zones are created using uma_zcreate(),
| > so they are regular zones. This means the pages are coming
| > from a slab in a keg, which are wired pages.
| > 
| > The only time the size of the slab/keg will be reduced by ZFS
| > is when it calls uma_zone_reclaim(.., UMA_RECLAIM_DRAIN),
| > which is called by arc_reap_cb(), triggered by arc_reap_cb_check().
| > 
| > arc_reap_cb_check() uses arc_available_memory() and triggers
| > arc_reap_cb() when arc_available_memory() returns a negative
| > value.
| > 
| > arc_available_memory() returns a negative value when
| > zfs_arc_free_target (vfs.zfs.arc.free_target) is greater than freemem.
| > (By default, zfs_arc_free_target is set to vm_cnt.v_free_taget.)
| > 
| > Does all of the above sound about right?
| > 
| > This leads me to...
| > - zfs_arc_free_target (vfs.zfs.arc.free_target) needs to be larger
| > or
| > - Most of the wired pages in the slab are per-cpu,
| >  so the uma_zone_reclaim() needs to UMA_RECLAIM_DRAIN_CPU
| >  on some systems. (Not the small test systems I have, where I
| >  cannot reproduce the problem.)
| > or
| > - uma_zone_reclaim() needs to be called under other
| >  circumstances.
| > or
| > - ???
| > 
| > How can you tell if a keg/slab is per-cpu?
| > (For my simple test system, I only see "UMA Slabs 0:" and
| > "UMA Slabs 1:". It looks like UMA Slabs 0: is being used for
| > ZFS arc allocation for this simple test system.)
| > 
| > Hopefully folk who understand ZFS arc allocation or UMA
| > can jump in and help out, rick
|