Re: git: 4a864f624a70 - main - vm_pageout: Print a more accurate message to the console before an OOM kill

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 15 Jan 2022 20:27:47 UTC
On 2022-Jan-15, at 07:55, Mark Johnston <markj@FreeBSD.org> wrote:

> On Fri, Jan 14, 2022 at 09:38:56PM -0800, Mark Millard wrote:
>> Thanks. This will allow me to remove part of my personal additions
>> in this area --and my having to explain the misnomer when trying
>> to help someone analyze why they end up with OOM activity so they
>> can figure out what to do about it.
>> 
>> There seem to be two separate sources of VM_OOM_SWAPZ. Showing
>> my personal additions for them (just making them explicit in the
>> sequence of messages generated):
>> 
>> diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
>> index 01cf9233329f..280621ca51be 100644
>> --- a/sys/vm/swap_pager.c
>> +++ b/sys/vm/swap_pager.c
>> @@ -2091,6 +2091,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t pindex, daddr_t swapblk)
>>                                    0, 1))
>>                                        printf("swap blk zone exhausted, "
>>                                            "increase kern.maxswzone\n");
>> +                               printf("swp_pager_meta_build: swap blk uma zone exhausted\n");
>>                                vm_pageout_oom(VM_OOM_SWAPZ);
>>                                pause("swzonxb", 10);
>>                        } else
>> @@ -2121,6 +2122,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t pindex, daddr_t swapblk)
>>                                    0, 1))
>>                                        printf("swap pctrie zone exhausted, "
>>                                            "increase kern.maxswzone\n");
>> +                               printf("swp_pager_meta_build: swap pctrie uma zone exhausted\n");
>>                                vm_pageout_oom(VM_OOM_SWAPZ);
>>                                pause("swzonxp", 10);
>>                        } else
>> 
>> Care to comment on the distinctions and why there are two
>> contexts classified as "out of swap space"? Would either
>> one show the swap space as (nearly?) all used in, say, top?
>> Or might one of them still end up looking like a misnomer
>> from just a top (or whatever) display?
> 
> Hmm, those cases should likely be changed from "out of swap space" to
> "failed to allocate swap metadata" or something like that.

Based on your description (later below), I agree.

> Running out
> of swap space is not itself a reason to trigger an OOM kill; if the page
> daemon can continue to reclaim clean pages while swap is full, then
> it'll do so without killing anything.  If the swap devices are full and
> the only way to reclaim memory is by laundering dirty pages, then
> "failed to reclaim memory" is the message you'd likely see after this
> commit.
> 
> The two cases which call vm_pageout_oom(VM_OOM_SWAPSZ) arise when the
> swap pager fails to allocate structures used to map physical pages to
> their location on a swap device.  swap_pager_swap_init() pre-allocates
> these structures during boot, and the size of the reserves is based on
> the amount of physical memory.  In particular, each VM object maintains
> a trie of "swap blocks", each of which maps a run of SWAP_META_PAGES
> pages contiguous within an object to individual blocks on a swap device.
> One zone provides internal nodes for the trie, while the other provides
> these swap blocks.  Assuming perfect efficiency, the reserves provide
> enough memory to allow all of physical memory to be swapped out, I
> believe.

The swap space can be bigger than the RAM space and something
approaching more like RAM+SWAP "memory space" can be in use
overall. The system complains of mistuning when (approximately)
the ratio of SWAPSPACE to RAMSIZE gets too large.

Relative to the mistuning notices, as I remember, armv7 (so:
32-bit), for example, has a noticably smaller multiplier of the
RAM size for how big a SWAP can be compared to, say, arm64/aarch64
(so: 64-bit). But, using aarch64 as an example, the complaints
start at somewhat under 4*RAMSIZE, where the
as-if-no-page-index-space-fragmentation figure would be somewhat
under 8*RAMSIZE. (8*RAMSIZE matches some documentation but
that documentation is appearently incorrect for the likes of
armv7.)

In other words, I think the wording above is misstated in its
detail but fairly accurate wording is probably a much more
complicated thing to provide and read.

> In practice there can be external fragmentation of the page
> index space which leads to less than perfect utilization of these
> metadata structures, in which case it's possible to exhaust the
> reserves.  This seems to be a fairly rare scenario though.


Side note on an example context that is possibly related to
the "fairly rare scenario":

At least one person has had the system consistently hang up
for a build activity when avoiding the configuration generating
the swap mistuning messages that I reference above. This was
when a monitoring top did not show swap as nearly fully used
--but definitely in significant use.

But all they had to do to avoid the hangup was have an even
larger swap space (so the mistuning message ended up being
generated). As I remember, the used swap space did not get to
the original swap space size in the monitoring via top. May
be the fragmentation of the swap space itself contributes so
the bigger swap space was easier to handle? (I've no actual
clue.)

Anyway, it was not a result I had expected. I still avoid
causing the swap space mistuning notice by keeping my swap
spaces somewhat under the sizes where the messages are
generated. (While fairly rare, I sometimes do experiments
where such a RAM+SWAP total size is important.)

===
Mark Millard
marklmi at yahoo.com