Re: The pagedaemon evicts ARC before scanning the inactive page list

From: Mark Johnston <markj_at_freebsd.org>
Date: Tue, 18 May 2021 21:45:18 UTC
On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:
> I'm using ZFS on servers with tons of RAM and running FreeBSD
> 12.2-RELEASE.  Sometimes they get into a pathological situation where most
> of that RAM sits unused.  For example, right now one of them has:
> 
> 2 GB   Active
> 529 GB Inactive
> 16 GB  Free
> 99 GB  ARC total
> 469 GB ARC max
> 86 GB  ARC target
> 
> When a server gets into this situation, it stays there for days, with the
> ARC target barely budging.  All that inactive memory never gets reclaimed
> and put to a good use.  Frequently the server never recovers until a reboot.
> 
> I have a theory for what's going on.  Ever since r334508^ the pagedaemon
> sends the vm_lowmem event _before_ it scans the inactive page list.  If the
> ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
> any.  Is that order really correct?  For reference, here's the relevant
> code, from vm_pageout_worker:

That was the case even before r334508.  Note that prior to that revision
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
scanning the inactive queue.  During a memory shortage we have pass > 0.
pass == 0 only when the page daemon is scanning the active queue.

> shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
> if (shortage > 0) {
>         ofree = vmd->vmd_free_count;
>         if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
>                 shortage -= min(vmd->vmd_free_count - ofree,
>                     (u_int)shortage);
>         target_met = vm_pageout_scan_inactive(vmd, shortage,
>             &addl_shortage);
> } else
>         addl_shortage = 0
> 
> Raising vfs.zfs.arc_min seems to workaround the problem.  But ideally that
> wouldn't be necessary.

vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage.  At the same time, the VM
doesn't know much about how much memory they are consuming.  A better
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem.  In your case, when the
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC.  Even
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.

Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data.  If the ARC is able to quickly
answer the question, "how much memory can I release if asked?", then
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.