From nobody Tue May 18 22:00:14 2021
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 19BCE5CA117
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Tue, 18 May 2021 22:00:28 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: from mail-oi1-f178.google.com (mail-oi1-f178.google.com [209.85.167.178])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4Fl90r0FRXz3tvS;
	Tue, 18 May 2021 22:00:27 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: by mail-oi1-f178.google.com with SMTP id c196so3058050oib.9;
        Tue, 18 May 2021 15:00:27 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=t3oNlWGRFYkYvnARN2FoZs47qu/Qes57h1jUYORHSdk=;
        b=cbi2DTv+muiRrjoTzUYH4nBWvakfl8qqNiEgOywaoj436vTPX9Yb1a4lfemrIv5qmt
         7AeWiNvu9BZffr4QAX3tZMHHkPLdHJzLeeRyrZNxktFOSv+TwTpWde3owPZG9N/Y40te
         vd87k8OaPNmrqNCZvjA7MhZFoyx0WvUlj9PkBfdrdXOcd+7TdchZSnz8C335a1bHIAio
         gNvY9Eac3eHbxNfIuJwKmJooD9RAbK+XJjx+G/ftN0fRl/i5HF1chMMsFEPyHu7x2ofO
         8//IhKSf6aK8SWnqfYZv8RlYvVgfD8zeYnalCEFDYByzBmwCUAd4EwEiISn9S5T/z7x2
         YfoQ==
X-Gm-Message-State: AOAM530cgntW3a825rD4gKChYvWOU1K1sdOt3INGfMoAcdcGX0oGXk+v
	phziGvgsDMBEbxYW+DJ+IqQH6FcTNOGcZnRIbHFePlMAqbQ=
X-Google-Smtp-Source: ABdhPJwcfsz1vpBlvVinYUaWqizZA+/02teVOMQ/iMFFxA314a7IDbI6FXZpC0H94tFLs5BrANEeDPxkt9IT0BhiV8g=
X-Received: by 2002:aca:d544:: with SMTP id m65mr5647662oig.73.1621375226001;
 Tue, 18 May 2021 15:00:26 -0700 (PDT)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:hackers+help@freebsd.org>
List-Post: <mailto:hackers@freebsd.org>
List-Subscribe: <mailto:hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
References: <CAOtMX2gvkrYS0zYYYtjD+Aaqv62MzFYFhWPHjLDGXA1=H7LfCg@mail.gmail.com>
 <YKQ1biSSGbluuy5f@nuc>
In-Reply-To: <YKQ1biSSGbluuy5f@nuc>
From: Alan Somers <asomers@freebsd.org>
Date: Tue, 18 May 2021 16:00:14 -0600
Message-ID: <CAOtMX2he1YBidG=zF=iUQw+Os7p=gWMk-sab00NVr0nNs=Cwog@mail.gmail.com>
Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list
To: Mark Johnston <markj@freebsd.org>
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>
Content-Type: multipart/alternative; boundary="00000000000064ba8e05c2a1d6b9"
X-Rspamd-Queue-Id: 4Fl90r0FRXz3tvS
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org;
	none
X-Spamd-Result: default: False [-4.00 / 15.00];
	 REPLY(-4.00)[]

--00000000000064ba8e05c2a1d6b9
Content-Type: text/plain; charset="UTF-8"

On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org> wrote:

> On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:
> > I'm using ZFS on servers with tons of RAM and running FreeBSD
> > 12.2-RELEASE.  Sometimes they get into a pathological situation where
> most
> > of that RAM sits unused.  For example, right now one of them has:
> >
> > 2 GB   Active
> > 529 GB Inactive
> > 16 GB  Free
> > 99 GB  ARC total
> > 469 GB ARC max
> > 86 GB  ARC target
> >
> > When a server gets into this situation, it stays there for days, with the
> > ARC target barely budging.  All that inactive memory never gets reclaimed
> > and put to a good use.  Frequently the server never recovers until a
> reboot.
> >
> > I have a theory for what's going on.  Ever since r334508^ the pagedaemon
> > sends the vm_lowmem event _before_ it scans the inactive page list.  If
> the
> > ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
> > any.  Is that order really correct?  For reference, here's the relevant
> > code, from vm_pageout_worker:
>
> That was the case even before r334508.  Note that prior to that revision
> vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
> scanning the inactive queue.  During a memory shortage we have pass > 0.
> pass == 0 only when the page daemon is scanning the active queue.
>
> > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
> > if (shortage > 0) {
> >         ofree = vmd->vmd_free_count;
> >         if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
> >                 shortage -= min(vmd->vmd_free_count - ofree,
> >                     (u_int)shortage);
> >         target_met = vm_pageout_scan_inactive(vmd, shortage,
> >             &addl_shortage);
> > } else
> >         addl_shortage = 0
> >
> > Raising vfs.zfs.arc_min seems to workaround the problem.  But ideally
> that
> > wouldn't be necessary.
>
> vm_lowmem is too primitive: it doesn't tell subscribing subsystems
> anything about the magnitude of the shortage.  At the same time, the VM
> doesn't know much about how much memory they are consuming.  A better
> strategy, at least for the ARC, would be reclaim memory based on the
> relative memory consumption of each subsystem.  In your case, when the
> page daemon goes to reclaim memory, it should use the inactive queue to
> make up ~85% of the shortfall and reclaim the rest from the ARC.  Even
> better would be if the ARC could use the page cache as a second-level
> cache, like the buffer cache does.
>
> Today I believe the ARC treats vm_lowmem as a signal to shed some
> arbitrary fraction of evictable data.  If the ARC is able to quickly
> answer the question, "how much memory can I release if asked?", then
> the page daemon could use that to determine how much of its reclamation
> target should come from the ARC vs. the page cache.
>

I guess I don't understand why you would ever free from the ARC rather than
from the inactive list.  When is inactive memory ever useful?

--00000000000064ba8e05c2a1d6b9
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail=
_attr">On Tue, May 18, 2021 at 3:45 PM Mark Johnston &lt;<a href=3D"mailto:=
markj@freebsd.org">markj@freebsd.org</a>&gt; wrote:<br></div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex">On Tue, May 18, 2021 at 03:07:44PM -060=
0, Alan Somers wrote:<br>
&gt; I&#39;m using ZFS on servers with tons of RAM and running FreeBSD<br>
&gt; 12.2-RELEASE.=C2=A0 Sometimes they get into a pathological situation w=
here most<br>
&gt; of that RAM sits unused.=C2=A0 For example, right now one of them has:=
<br>
&gt; <br>
&gt; 2 GB=C2=A0 =C2=A0Active<br>
&gt; 529 GB Inactive<br>
&gt; 16 GB=C2=A0 Free<br>
&gt; 99 GB=C2=A0 ARC total<br>
&gt; 469 GB ARC max<br>
&gt; 86 GB=C2=A0 ARC target<br>
&gt; <br>
&gt; When a server gets into this situation, it stays there for days, with =
the<br>
&gt; ARC target barely budging.=C2=A0 All that inactive memory never gets r=
eclaimed<br>
&gt; and put to a good use.=C2=A0 Frequently the server never recovers unti=
l a reboot.<br>
&gt; <br>
&gt; I have a theory for what&#39;s going on.=C2=A0 Ever since r334508^ the=
 pagedaemon<br>
&gt; sends the vm_lowmem event _before_ it scans the inactive page list.=C2=
=A0 If the<br>
&gt; ARC frees enough memory, then vm_pageout_scan_inactive won&#39;t need =
to free<br>
&gt; any.=C2=A0 Is that order really correct?=C2=A0 For reference, here&#39=
;s the relevant<br>
&gt; code, from vm_pageout_worker:<br>
<br>
That was the case even before r334508.=C2=A0 Note that prior to that revisi=
on<br>
vm_pageout_scan_inactive() would trigger vm_lowmem if pass &gt; 0, before<b=
r>
scanning the inactive queue.=C2=A0 During a memory shortage we have pass &g=
t; 0.<br>
pass =3D=3D 0 only when the page daemon is scanning the active queue.<br>
<br>
&gt; shortage =3D pidctrl_daemon(&amp;vmd-&gt;vmd_pid, vmd-&gt;vmd_free_cou=
nt);<br>
&gt; if (shortage &gt; 0) {<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ofree =3D vmd-&gt;vmd_free_count;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (vm_pageout_lowmem() &amp;&amp; vm=
d-&gt;vmd_free_count &gt; ofree)<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0shortage =
-=3D min(vmd-&gt;vmd_free_count - ofree,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0(u_int)shortage);<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0target_met =3D vm_pageout_scan_inacti=
ve(vmd, shortage,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&amp;addl_shortage);<br=
>
&gt; } else<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addl_shortage =3D 0<br>
&gt; <br>
&gt; Raising vfs.zfs.arc_min seems to workaround the problem.=C2=A0 But ide=
ally that<br>
&gt; wouldn&#39;t be necessary.<br>
<br>
vm_lowmem is too primitive: it doesn&#39;t tell subscribing subsystems<br>
anything about the magnitude of the shortage.=C2=A0 At the same time, the V=
M<br>
doesn&#39;t know much about how much memory they are consuming.=C2=A0 A bet=
ter<br>
strategy, at least for the ARC, would be reclaim memory based on the<br>
relative memory consumption of each subsystem.=C2=A0 In your case, when the=
<br>
page daemon goes to reclaim memory, it should use the inactive queue to<br>
make up ~85% of the shortfall and reclaim the rest from the ARC.=C2=A0 Even=
<br>
better would be if the ARC could use the page cache as a second-level<br>
cache, like the buffer cache does.<br>
<br>
Today I believe the ARC treats vm_lowmem as a signal to shed some<br>
arbitrary fraction of evictable data.=C2=A0 If the ARC is able to quickly<b=
r>
answer the question, &quot;how much memory can I release if asked?&quot;, t=
hen<br>
the page daemon could use that to determine how much of its reclamation<br>
target should come from the ARC vs. the page cache.<br></blockquote><div><b=
r></div><div>I guess I don&#39;t understand why you would ever free from th=
e ARC rather than from the inactive list.=C2=A0 When is inactive memory eve=
r useful?<br></div></div></div>

--00000000000064ba8e05c2a1d6b9--