Re: RFC: How ZFS handles arc memory use
- In reply to: Rick Macklem : "RFC: How ZFS handles arc memory use"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 22 Oct 2025 15:42:11 UTC
On Wed, Oct 22, 2025 at 07:34:39AM -0700, Rick Macklem wrote: > Hi, > > A couple of people have reported problems with NFS servers, > where essentially all of the system's memory gets exhausted. > They see the problem on 14.n FreeBSD servers (which use the > newer ZFS code) but not on 13.n servers. > > I am trying to learn how ZFS handles arc memory use to try > and figure out what can be done about this problem. > > I know nothing about ZFS internals or UMA(9) internals, > so I could be way off, but here is what I think is happening. > (Please correct me on this.) > > The L1ARC uses uma_zalloc_arg()/uma_zfree_arg() to allocate > the arc memory. The zones are created using uma_zcreate(), > so they are regular zones. This means the pages are coming > from a slab in a keg, which are wired pages. > > The only time the size of the slab/keg will be reduced by ZFS > is when it calls uma_zone_reclaim(.., UMA_RECLAIM_DRAIN), > which is called by arc_reap_cb(), triggered by arc_reap_cb_check(). > > arc_reap_cb_check() uses arc_available_memory() and triggers > arc_reap_cb() when arc_available_memory() returns a negative > value. > > arc_available_memory() returns a negative value when > zfs_arc_free_target (vfs.zfs.arc.free_target) is greater than freemem. > (By default, zfs_arc_free_target is set to vm_cnt.v_free_taget.) > > Does all of the above sound about right? It's been a while since I've looked, but that sounds roughly correct. Note that the vm_lowmem eventhandler is invoked when fewer than v_free_target pages are available, and this should pressure ZFS into shrinking the ARC. > This leads me to... > - zfs_arc_free_target (vfs.zfs.arc.free_target) needs to be larger > or > - Most of the wired pages in the slab are per-cpu, > so the uma_zone_reclaim() needs to UMA_RECLAIM_DRAIN_CPU > on some systems. (Not the small test systems I have, where I > cannot reproduce the problem.) The number of wired pages belonging to per-CPU caches should be fairly small, since the size of each CPU's cache is bounded to 2*bucket size*ncpu items. For instance, the ZFS ABD chunk zone on my build system has $(sysctl -n vm.uma.abd_chunk.bucket_size) == 220 items per bucket. Each item is a page, so that gives an upper bound of 220*2*32 pages in the ABD zone per-CPU caches. That's about 56MB, which is not a huge amount on this system with 128GB of RAM. > or > - uma_zone_reclaim() needs to be called under other > circumstances. > or > - ??? > > How can you tell if a keg/slab is per-cpu? > (For my simple test system, I only see "UMA Slabs 0:" and > "UMA Slabs 1:". It looks like UMA Slabs 0: is being used for > ZFS arc allocation for this simple test system.) A slab is the backend allocation unit for (most) UMA zones. A keg is a structure which manages slabs. When the frontend needs to allocate a new item, it asks the keg for one; the keg then either returns an item from an existing slab, or allocates a new slab from the VM system. The frontend is a "zone", it employs per-CPU caching to try and make the allocation and free paths cheap and scalable, i.e., in the common case there is no need to acquire any locks. The zone maintains several "buckets" of free items per CPU. When an allocation misses in the per-CPU cache, a per-zone linked list of full buckets is used. If that list is empty, we go to the keg and ask it to give us more items. When a keg allocates a slab, it must also allocate a structure which tracks the state of each item within the subdivided slab. These are the "UMA Slabs" zones you referred to. For some types of items, the slab header can be stored within the slab itself, so no explicit allocation is required. For other cases (including ZFS ABD buffers which are used to populate the ARC), a separate allocation from these zones is required. > Hopefully folk who understand ZFS arc allocation or UMA > can jump in and help out, rick >