Re: Chasing OOM Issues - good sysctl metrics to use?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Fri, 22 Apr 2022 04:18:37 UTC
Pete Wright <pete_at_nomadlogic.org> wrote on
Date: Thu, 21 Apr 2022 19:16:42 -0700 :

> on my workstation running CURRENT (amd64/32g of ram) i've been running 
> into a scenario where after 4 or 5 days of daily use I get an OOM event 
> and both chromium and firefox are killed.  then in the next day or so 
> the system will become very unresponsive in the morning when i unlock my 
> screensaver in the morning forcing a manual power cycle.
> 
> one thing i've noticed is growing swap usage but plenty of free and 
> inactive memory as well as a GB or so of memory in the Laundry state 
> according top.  my understanding is that seeing swap usage grow over 
> time is expected and doesn't necessarily indicate a problem.  but what 
> concerns me is the system locking up while seeing quite a bit of disk 
> i/o (maybe from paging back in?).
> 
> in order to help chase this down i've setup the 
> prometheus_sysctl_exporter(8) to send data to a local prometheus 
> instance.  the goal is to examine memory utilizaton over time to help 
> detect any issues. so my question is this:
> 
> what OID's would be useful to help see to help diagnose weird memory 
> issues like this?
> 
> i'm currently looking at:
> sysctl_vm_domain_0_stats_laundry
> sysctl_vm_domain_0_stats_active
> sysctl_vm_domain_0_stats_free_count
> sysctl_vm_domain_0_stats_inactive_pps
> 
> 
> thanks in advance - and i'd be happy to share my data if anyone is 
> interested :)

Messages in the console out would be appropriate
to report. Messages might also be available via
the following at appropriate times:

# dmesg -a
. . .

or:

# more /var/log/messages
. . .

Generally messages from after the boot is complete
are more relevant.


Messages like the following are some examples
that would be of interest:

pid . . .(c++), jid . . ., uid . . ., was killed: failed to reclaim memory
pid . . .(c++), jid . . ., uid . . ., was killed: a thread waited too long to allocate a page
pid . . .(c++), jid . . ., uid . . ., was killed: out of swap space

(That last is somewhat of a misnomer for the internal
issue that leads to it.)

I'm hoping you got message(s) of one or more of the above
kinds. But others are also relevant:

. . . kernel: swap_pager: out of swap space
. . . kernel: swp_pager_getswapspace(7): failed

. . . kernel: swap_pager: indefinite wait buffer: bufobj: . . ., blkno: . . ., size: . . .

(Those messages do not announce a process kill but
give some evidence about context.)

Some of the messages with part of the text matching
actually identify somewhat different contexts --so
each message type is relevant.

There may be other types of messages that are relevant.

The sequencing of the messages could be relevant.

Do you have any swap partitions set up and in use? The
details could be relevant. Do you have swap set up
some other way than via swap partition use? No swap?

If 1+ swap partitions are in use, things that suggest
the speeds/latency characteristics of the I/O to the
drive could be relevant.

ZFS (so with ARC)? UFS? Both?

The first block of lines from a top display could be
relevant, particularly when it is clearly progressing
towards having the problem. (After the problem is too
late.) (I just picked top as a way to get a bunch of
the information all together automatically.)

These sorts of things might help folks help you.

===
Mark Millard
marklmi at yahoo.com