RE: "failed to reclaim memory" with much free physmem

From: Garrett Wollman <wollman_at_bimajority.org>
Date: Thu, 11 Sep 2025 17:58:22 UTC
<<On Tue, 9 Sep 2025 12:19:21 -0700, Mark Millard <marklmi@yahoo.com> said:

> Garrett Wollman <wollman_at_bimajority.org> wrote on
> Date: Tue, 09 Sep 2025 16:19:42 UTC :

>> On some of our newer large-memory NFS servers, we are seeing services
>> killed with "failed to reclaim memory". According to our monitoring,
>> the server has >100G of physmem free at the time,

> Was that 100G+ somewhat before any reclaiming of memory started,
> the lead-up to the notice?

That was within five minutes of munin-node getting shot by the OOM
killer.  There was much less memory free ca. 24 hours before the
event.

> Any likelihood of sudden, rapid, huge drops in free RAM based on
> workload behavior?

I don't have access to client workloads, but it would have to be a bug
in ZFS if so; these are file servers, all they run is NFS.

> Is NUMA involved?

Damn if I know.

>> and the only
>> solution seems to be rebooting. (There is a small amount of swap
>> configured and even less of it in use.)

> That swap is in use at all could be of interest. I wonder
> whaat it was doing when the swap was put to use or laundry
> was growing that lead to swap being put to use.

It's pretty normal on these servers, which stay up for six months
between OS upgrades, for some userland daemons to get swapped out,
although I agree that it seems like it shouldn't happen given that the
size of memory (1 TiB) is much greater than the size of running
processes (< 1 GiB).

My suspicion here is that there's some sort of accounting error, but I
don't know where to look, and I only have data retrospectively, and
only the data that munin is collecting.  (Someone else was on call
when this happened most recently and they reported that their login
shell kept on getting shot -- as was the getty on the serial console.)

-GAWollman