From nobody Tue May 18 22:00:14 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 19BCE5CA117 for ; Tue, 18 May 2021 22:00:28 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f178.google.com (mail-oi1-f178.google.com [209.85.167.178]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Fl90r0FRXz3tvS; Tue, 18 May 2021 22:00:27 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f178.google.com with SMTP id c196so3058050oib.9; Tue, 18 May 2021 15:00:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=t3oNlWGRFYkYvnARN2FoZs47qu/Qes57h1jUYORHSdk=; b=cbi2DTv+muiRrjoTzUYH4nBWvakfl8qqNiEgOywaoj436vTPX9Yb1a4lfemrIv5qmt 7AeWiNvu9BZffr4QAX3tZMHHkPLdHJzLeeRyrZNxktFOSv+TwTpWde3owPZG9N/Y40te vd87k8OaPNmrqNCZvjA7MhZFoyx0WvUlj9PkBfdrdXOcd+7TdchZSnz8C335a1bHIAio gNvY9Eac3eHbxNfIuJwKmJooD9RAbK+XJjx+G/ftN0fRl/i5HF1chMMsFEPyHu7x2ofO 8//IhKSf6aK8SWnqfYZv8RlYvVgfD8zeYnalCEFDYByzBmwCUAd4EwEiISn9S5T/z7x2 YfoQ== X-Gm-Message-State: AOAM530cgntW3a825rD4gKChYvWOU1K1sdOt3INGfMoAcdcGX0oGXk+v phziGvgsDMBEbxYW+DJ+IqQH6FcTNOGcZnRIbHFePlMAqbQ= X-Google-Smtp-Source: ABdhPJwcfsz1vpBlvVinYUaWqizZA+/02teVOMQ/iMFFxA314a7IDbI6FXZpC0H94tFLs5BrANEeDPxkt9IT0BhiV8g= X-Received: by 2002:aca:d544:: with SMTP id m65mr5647662oig.73.1621375226001; Tue, 18 May 2021 15:00:26 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Tue, 18 May 2021 16:00:14 -0600 Message-ID: Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list To: Mark Johnston Cc: FreeBSD Hackers Content-Type: multipart/alternative; boundary="00000000000064ba8e05c2a1d6b9" X-Rspamd-Queue-Id: 4Fl90r0FRXz3tvS X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] --00000000000064ba8e05c2a1d6b9 Content-Type: text/plain; charset="UTF-8" On Tue, May 18, 2021 at 3:45 PM Mark Johnston wrote: > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote: > > I'm using ZFS on servers with tons of RAM and running FreeBSD > > 12.2-RELEASE. Sometimes they get into a pathological situation where > most > > of that RAM sits unused. For example, right now one of them has: > > > > 2 GB Active > > 529 GB Inactive > > 16 GB Free > > 99 GB ARC total > > 469 GB ARC max > > 86 GB ARC target > > > > When a server gets into this situation, it stays there for days, with the > > ARC target barely budging. All that inactive memory never gets reclaimed > > and put to a good use. Frequently the server never recovers until a > reboot. > > > > I have a theory for what's going on. Ever since r334508^ the pagedaemon > > sends the vm_lowmem event _before_ it scans the inactive page list. If > the > > ARC frees enough memory, then vm_pageout_scan_inactive won't need to free > > any. Is that order really correct? For reference, here's the relevant > > code, from vm_pageout_worker: > > That was the case even before r334508. Note that prior to that revision > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before > scanning the inactive queue. During a memory shortage we have pass > 0. > pass == 0 only when the page daemon is scanning the active queue. > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count); > > if (shortage > 0) { > > ofree = vmd->vmd_free_count; > > if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree) > > shortage -= min(vmd->vmd_free_count - ofree, > > (u_int)shortage); > > target_met = vm_pageout_scan_inactive(vmd, shortage, > > &addl_shortage); > > } else > > addl_shortage = 0 > > > > Raising vfs.zfs.arc_min seems to workaround the problem. But ideally > that > > wouldn't be necessary. > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems > anything about the magnitude of the shortage. At the same time, the VM > doesn't know much about how much memory they are consuming. A better > strategy, at least for the ARC, would be reclaim memory based on the > relative memory consumption of each subsystem. In your case, when the > page daemon goes to reclaim memory, it should use the inactive queue to > make up ~85% of the shortfall and reclaim the rest from the ARC. Even > better would be if the ARC could use the page cache as a second-level > cache, like the buffer cache does. > > Today I believe the ARC treats vm_lowmem as a signal to shed some > arbitrary fraction of evictable data. If the ARC is able to quickly > answer the question, "how much memory can I release if asked?", then > the page daemon could use that to determine how much of its reclamation > target should come from the ARC vs. the page cache. > I guess I don't understand why you would ever free from the ARC rather than from the inactive list. When is inactive memory ever useful? --00000000000064ba8e05c2a1d6b9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org> wrote:
On Tue, May 18, 2021 at 03:07:44PM -060= 0, Alan Somers wrote:
> I'm using ZFS on servers with tons of RAM and running FreeBSD
> 12.2-RELEASE.=C2=A0 Sometimes they get into a pathological situation w= here most
> of that RAM sits unused.=C2=A0 For example, right now one of them has:=
>
> 2 GB=C2=A0 =C2=A0Active
> 529 GB Inactive
> 16 GB=C2=A0 Free
> 99 GB=C2=A0 ARC total
> 469 GB ARC max
> 86 GB=C2=A0 ARC target
>
> When a server gets into this situation, it stays there for days, with = the
> ARC target barely budging.=C2=A0 All that inactive memory never gets r= eclaimed
> and put to a good use.=C2=A0 Frequently the server never recovers unti= l a reboot.
>
> I have a theory for what's going on.=C2=A0 Ever since r334508^ the= pagedaemon
> sends the vm_lowmem event _before_ it scans the inactive page list.=C2= =A0 If the
> ARC frees enough memory, then vm_pageout_scan_inactive won't need = to free
> any.=C2=A0 Is that order really correct?=C2=A0 For reference, here'= ;s the relevant
> code, from vm_pageout_worker:

That was the case even before r334508.=C2=A0 Note that prior to that revisi= on
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before scanning the inactive queue.=C2=A0 During a memory shortage we have pass &g= t; 0.
pass =3D=3D 0 only when the page daemon is scanning the active queue.

> shortage =3D pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_cou= nt);
> if (shortage > 0) {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ofree =3D vmd->vmd_free_count;
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (vm_pageout_lowmem() && vm= d->vmd_free_count > ofree)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0shortage = -=3D min(vmd->vmd_free_count - ofree,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0(u_int)shortage);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0target_met =3D vm_pageout_scan_inacti= ve(vmd, shortage,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&addl_shortage); > } else
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addl_shortage =3D 0
>
> Raising vfs.zfs.arc_min seems to workaround the problem.=C2=A0 But ide= ally that
> wouldn't be necessary.

vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage.=C2=A0 At the same time, the V= M
doesn't know much about how much memory they are consuming.=C2=A0 A bet= ter
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem.=C2=A0 In your case, when the=
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC.=C2=A0 Even=
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.

Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data.=C2=A0 If the ARC is able to quickly answer the question, "how much memory can I release if asked?", t= hen
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.
I guess I don't understand why you would ever free from th= e ARC rather than from the inactive list.=C2=A0 When is inactive memory eve= r useful?
--00000000000064ba8e05c2a1d6b9--