Re: The pagedaemon evicts ARC before scanning the inactive page list

From: Alan Somers <asomers_at_freebsd.org>
Date: Wed, 19 May 2021 03:55:25 UTC
On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Tue, May 18, 2021 at 05:55:36PM -0600, Alan Somers wrote:
> > On Tue, May 18, 2021 at 4:10 PM Mark Johnston <markj@freebsd.org> wrote:
> >
> > > On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:
> > > > On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org>
> wrote:
> > > >
> > > > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:
> > > > > > I'm using ZFS on servers with tons of RAM and running FreeBSD
> > > > > > 12.2-RELEASE.  Sometimes they get into a pathological situation
> where
> > > > > most
> > > > > > of that RAM sits unused.  For example, right now one of them has:
> > > > > >
> > > > > > 2 GB   Active
> > > > > > 529 GB Inactive
> > > > > > 16 GB  Free
> > > > > > 99 GB  ARC total
> > > > > > 469 GB ARC max
> > > > > > 86 GB  ARC target
> > > > > >
> > > > > > When a server gets into this situation, it stays there for days,
> > > with the
> > > > > > ARC target barely budging.  All that inactive memory never gets
> > > reclaimed
> > > > > > and put to a good use.  Frequently the server never recovers
> until a
> > > > > reboot.
> > > > > >
> > > > > > I have a theory for what's going on.  Ever since r334508^ the
> > > pagedaemon
> > > > > > sends the vm_lowmem event _before_ it scans the inactive page
> list.
> > > If
> > > > > the
> > > > > > ARC frees enough memory, then vm_pageout_scan_inactive won't
> need to
> > > free
> > > > > > any.  Is that order really correct?  For reference, here's the
> > > relevant
> > > > > > code, from vm_pageout_worker:
> > > > >
> > > > > That was the case even before r334508.  Note that prior to that
> > > revision
> > > > > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0,
> before
> > > > > scanning the inactive queue.  During a memory shortage we have
> pass >
> > > 0.
> > > > > pass == 0 only when the page daemon is scanning the active queue.
> > > > >
> > > > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
> > > > > > if (shortage > 0) {
> > > > > >         ofree = vmd->vmd_free_count;
> > > > > >         if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
> > > > > >                 shortage -= min(vmd->vmd_free_count - ofree,
> > > > > >                     (u_int)shortage);
> > > > > >         target_met = vm_pageout_scan_inactive(vmd, shortage,
> > > > > >             &addl_shortage);
> > > > > > } else
> > > > > >         addl_shortage = 0
> > > > > >
> > > > > > Raising vfs.zfs.arc_min seems to workaround the problem.  But
> ideally
> > > > > that
> > > > > > wouldn't be necessary.
> > > > >
> > > > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems
> > > > > anything about the magnitude of the shortage.  At the same time,
> the VM
> > > > > doesn't know much about how much memory they are consuming.  A
> better
> > > > > strategy, at least for the ARC, would be reclaim memory based on
> the
> > > > > relative memory consumption of each subsystem.  In your case, when
> the
> > > > > page daemon goes to reclaim memory, it should use the inactive
> queue to
> > > > > make up ~85% of the shortfall and reclaim the rest from the ARC.
> Even
> > > > > better would be if the ARC could use the page cache as a
> second-level
> > > > > cache, like the buffer cache does.
> > > > >
> > > > > Today I believe the ARC treats vm_lowmem as a signal to shed some
> > > > > arbitrary fraction of evictable data.  If the ARC is able to
> quickly
> > > > > answer the question, "how much memory can I release if asked?",
> then
> > > > > the page daemon could use that to determine how much of its
> reclamation
> > > > > target should come from the ARC vs. the page cache.
> > > > >
> > > >
> > > > I guess I don't understand why you would ever free from the ARC
> rather
> > > than
> > > > from the inactive list.  When is inactive memory ever useful?
> > >
> > > Pages in the inactive queue are either unmapped or haven't had their
> > > mappings referenced recently.  But they may still be frequently
> accessed
> > > by file I/O operations like sendfile(2).  That's not to say that
> > > reclaiming from other subsystems first is always the right strategy,
> but
> > > note also that the page daemon may scan the inactive queue many times
> in
> > > between vm_lowmem calls.
> > >
> >
> > So By default ZFS tries to free (arc_target / 128) bytes of memory in
> > arc_lowmem.  That's huge!  On this server, pidctrl_daemon typically
> > requests 0-10MB, and arc_lowmem tries to free 600 MB.  It looks like it
> > would be easy to modify vm_lowmem to include the total amount of memory
> > that it wants freed.  I could make such a patch.  My next question is:
> > what's the fastest way to generate a lot of inactive memory?  My first
> > attempt was "find . | xargs md5", but that isn't terribly effective.  The
> > production machines are doing a lot of "zfs recv" and running some busy
> Go
> > programs, among other things, but I can't easily replicate that workload
> on
>
> Is your machine ZFS-only?  If yes, then typical source of inactive memory
> can be of two kinds:
>

No, there is also FUSE.  But there is typically < 1GB of Buf memory, so I
didn't mention it.


> - anonymous memory that apps allocate with facilities like malloc(3).
>   If inactive is shrinkable then it is probably not, because dirty pages
>   from anon objects must go through laundry->swap route to get evicted,
>   and you did not mentioned swapping
>

No, there's no appreciable amount of swapping going on.  Nor is the laundry
list typically more than a few hundred MB.


> - double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.
>   If unmapped, these are mostly a waste.  Even if mapped, the source
>   of truth for data is ARC, AFAIU, so they can be dropped as well, since
>   inactive state means that its content is not hot.
>

So if a process mmap()'s a file on ZFS and reads from it but never writes
to it, will those pages show up as inactive?


>
> You can try to inspect the most outstanding objects adding to the
> inactive queue with 'vmobject -o' to see where the most of inactive pages
> come from.
>

Wow, that did it!  About 99% of the inactive pages come from just a few
vnodes which are used by the FUSE servers.  But I also see a few large
entries like
1105308 333933 771375   1   0 WB  df
what does that signify?


>
> If indeed they are double-copy, then perhaps ZFS can react even to the
> current primitive vm_lowmem signal somewhat different. First, it could
> do the pass over its vnodes and
> - free clean unmapped pages
> - if some targets are not met after that, laundry dirty pages,
>   then return to freeing clean unmapped pages
> all that before ever touching its cache (ARC).
>