Re: Chasing OOM Issues - good sysctl metrics to use?

From: Pete Wright <pete_at_nomadlogic.org>
Date: Fri, 22 Apr 2022 23:42:42 UTC

On 4/21/22 21:18, Mark Millard wrote:
>
> Messages in the console out would be appropriate
> to report. Messages might also be available via
> the following at appropriate times:

that is what is frustrating.  i will get notification that the processes 
are killed:
Apr 22 09:55:15 topanga kernel: pid 76242 (chrome), jid 0, uid 1001, was 
killed: failed to reclaim memory
Apr 22 09:55:19 topanga kernel: pid 76288 (chrome), jid 0, uid 1001, was 
killed: failed to reclaim memory
Apr 22 09:55:20 topanga kernel: pid 76259 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory
Apr 22 09:55:22 topanga kernel: pid 76252 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory
Apr 22 09:55:23 topanga kernel: pid 76267 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory
Apr 22 09:55:24 topanga kernel: pid 76234 (chrome), jid 0, uid 1001, was 
killed: failed to reclaim memory
Apr 22 09:55:26 topanga kernel: pid 76275 (firefox), jid 0, uid 1001, 
was killed: failed to reclaim memory

the system in this case had killed both firefox and chrome while i was 
afk.  i logged back in and started them up to do more more, then the 
next logline is from this morning when i had to force power off/on the 
system as they keyboard and network were both unresponsive:

Apr 22 09:58:20 topanga syslogd: kernel boot file is /boot/kernel/kernel

> Do you have any swap partitions set up and in use? The
> details could be relevant. Do you have swap set up
> some other way than via swap partition use? No swap?
yes i have a 2GB of swap that resides on a nvme device.
> ZFS (so with ARC)? UFS? Both?

i am using ZFS and am setting my vfs.zfs.arc.max to 10G.  i have also 
experienced this crash with that set to the default unlimited value as well.

>
> The first block of lines from a top display could be
> relevant, particularly when it is clearly progressing
> towards having the problem. (After the problem is too
> late.) (I just picked top as a way to get a bunch of
> the information all together automatically.)

since the initial OOM events happen when i am AFK it is difficult to get 
relevant stats out of top.

this is why i've started collecting more detailed metrics in 
prometheus.  my hope is i'll be able to do a better job observing how my 
system is behaving over time, in the run up to the OOM event as well as 
right before and after.  there are heaps of metrics collected though so 
hoping someone can point me in the right direction :)

-pete


-- 
Pete Wright
pete@nomadlogic.org
@nomadlogicLA