Re: Chasing OOM Issues - good sysctl metrics to use?

From: Pete Wright <pete_at_nomadlogic.org>
Date: Fri, 29 Apr 2022 20:41:15 UTC

On 4/29/22 11:38, Mark Millard wrote:
> On 2022-Apr-29, at 11:08, Pete Wright <pete@nomadlogic.org> wrote:
>
>> On 4/23/22 19:20, Pete Wright wrote:
>>>> The developers handbook has a section debugging deadlocks that he
>>>> referenced in a response to another report (on freebsd-hackers).
>>>>
>>>> https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/#kerneldebug-deadlocks
>>> d'oh - thanks for the correction!
>>>
>>> -pete
>>>
>>>
>> hello, i just wanted to provide an update on this issue.  so the good news is that by removing the file backed swap the deadlocks have indeed gone away!  thanks for sorting me out on that front Mark!
> Glad it helped.

d'oh - went out for lunch and workstation locked up.  i *knew* i 
shouldn't have said anything lol.

>> i still am seeing a memory leak with either firefox or chrome (maybe both where they create a voltron of memory leaks?).  this morning firefox and chrome had been killed when i first logged in. fortunately the system has remained responsive for several hours which was not the case previously.
>>
>> when looking at my metrics i see vm.domain.0.stats.inactive take a nose dive from around 9GB to 0 over the course of 1min.  the timing seems to align with around the time when firefox crashed, and is proceeded by a large spike in vm.domain.0.stats.active from ~1GB to 7GB 40mins before the apps crashed.  after the binaries were killed memory metrics seem to have recovered (laundry size grew, and inactive size grew by several gigs for example).
> Since the form of kill here is tied to sustained low free memory
> ("failed to reclaim memory"), you might want to report the
> vm.domain.0.stats.free_count figures from various time frames as
> well:
>
> vm.domain.0.stats.free_count: Free pages
>
> (It seems you are converting pages to byte counts in your report,
> the units I'm not really worried about so long as they are
> obvious.)
>
> There are also figures possibly tied to the handling of the kill
> activity but some being more like thresholds than usage figures,
> such as:
>
> vm.domain.0.stats.free_severe: Severe free pages
> vm.domain.0.stats.free_min: Minimum free pages
> vm.domain.0.stats.free_reserved: Reserved free pages
> vm.domain.0.stats.free_target: Target free pages
> vm.domain.0.stats.inactive_target: Target inactive pages
ok thanks Mark, based on this input and the fact i did manage to lock up 
my system, i'm going to get some metrics up on my website and share them 
publicly when i have time.  i'll definitely take you input into account 
when sharing this info.

>
> Also, what value were you using for:
>
> vm.pageout_oom_seq
$ sysctl vm.pageout_oom_seq
vm.pageout_oom_seq: 120
$

cheers,
-pete

-- 
Pete Wright
pete@nomadlogic.org
@nomadlogicLA