RE: Chasing OOM Issues - good sysctl metrics to use?

From: Michael Jung <mikej_at_paymentallianceintl.com>
Date: Sun, 24 Apr 2022 04:06:06 UTC
Hi:

I guess I'm kind of high jacking this thread but I think what I'm
going to ask is close enough to the topic at hand to ask instead
of starting a new thread and referencing this one.

I use sysutils/monit to what running processes and then restart them
I as I want.

I use  protect(1) to make sure that monit would not dies.

In /etc/rc.local "protect -i monit"

protect seems in the end simply call

     PROC_SPROTECT with the INHERIT flag and as documented in procctl(2)

So I followed a bit of code I guess that cools if I got it right but I know
about .0001% about system internals.

Can anyone speak to how protect(1) works and is it in itself is prone to what has been discussed?

For my use case is protect "good enough" or do I really need to tuning like has been talked about?

If protect is the right answer and someone could explain how it does
Its thing at a slighter higher technical barrier I would love to hear
more about why I'm either doing it wrong, that that what I'm doing it ok, or
why I should really be doing something completely different and the why I
should be doing it differently.

I suspect there are many that would like to know this but would never ask,
at least not on list.

Always the seeker of new knowledge.

Thanks in advance.

--mikej





CONFIDENTIALITY NOTE: This message is intended only for the use
of the individual or entity to whom it is addressed and may
contain information that is privileged, confidential, and
exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby
notified that any dissemination, distribution or copying
of this communication is strictly prohibited. If you have
received this transmission in error, please notify us by
telephone at (502) 212-4000 or notify us at: PAI, Dept. 99,
2101 High Wickham Place, Suite 101, Louisville, KY 40245



From: owner-freebsd-current@freebsd.org [mailto:owner-freebsd-current@freebsd.org] On Behalf Of Mark Millard
Sent: Saturday, April 23, 2022 3:32 PM
To: Pete Wright <pete@nomadlogic.org>
Cc: freebsd-current <freebsd-current@freebsd.org>
Subject: Re: Chasing OOM Issues - good sysctl metrics to use?

On 2022-Apr-23, at 10:26, Pete Wright <pete@nomadlogic.org<mailto:pete@nomadlogic.org>> wrote:

> On 4/22/22 18:46, Mark Millard wrote:
>> On 2022-Apr-22, at 16:42, Pete Wright <pete@nomadlogic.org<mailto:pete@nomadlogic.org>> wrote:
>>
>>> On 4/21/22 21:18, Mark Millard wrote:
>>>> Messages in the console out would be appropriate
>>>> to report. Messages might also be available via
>>>> the following at appropriate times:
>>> that is what is frustrating. i will get notification that the processes are killed:
>>> Apr 22 09:55:15 topanga kernel: pid 76242 (chrome), jid 0, uid 1001, was killed: failed to reclaim memory
>>> Apr 22 09:55:19 topanga kernel: pid 76288 (chrome), jid 0, uid 1001, was killed: failed to reclaim memory
>>> Apr 22 09:55:20 topanga kernel: pid 76259 (firefox), jid 0, uid 1001, was killed: failed to reclaim memory
>>> Apr 22 09:55:22 topanga kernel: pid 76252 (firefox), jid 0, uid 1001, was killed: failed to reclaim memory
>>> Apr 22 09:55:23 topanga kernel: pid 76267 (firefox), jid 0, uid 1001, was killed: failed to reclaim memory
>>> Apr 22 09:55:24 topanga kernel: pid 76234 (chrome), jid 0, uid 1001, was killed: failed to reclaim memory
>>> Apr 22 09:55:26 topanga kernel: pid 76275 (firefox), jid 0, uid 1001, was killed: failed to reclaim memory
>> Those messages are not reporting being out of swap
>> as such. They are reporting sustained low free RAM
>> despite a number of less drastic attempts to gain
>> back free RAM (to above some threshold).
>>
>> FreeBSD does not swap out the kernel stacks for
>> processes that stay in a runnable state: it just
>> continues to page. Thus just one large process
>> that has a huge working set of active pages can
>> lead to OOM kills in a context were no other set
>> of processes would be enough to gain the free
>> RAM required. Such contexts are not really a
>> swap issue.
>
> Thank you for this clarification/explanation - that totally makes sense!
>
>>
>> Based on there being only 1 "killed:" reason,
>> I have a suggestion that should allow delaying
>> such kills for a long time. That in turn may
>> help with investigating without actually
>> suffering the kills during the activity: more
>> time with low free RAM to observe.
>
> Great idea thank-you! and thanks for the example settings and descriptions as well.
>> But those are large but finite activities. If
>> you want to leave something running for days,
>> weeks, months, or whatever that produces the
>> sustained low free RAM conditions, the problem
>> will eventually happen. Ultimately one may have
>> to exit and restart such processes once and a
>> while, exiting enough of them to give a little
>> time with sufficient free RAM.
> perfect - since this is a workstation my run-time for these processes is probably a week as i update my system and pkgs over the weekend, then dog food current during the work week.
>
>>> yes i have a 2GB of swap that resides on a nvme device.
>> I assume a partition style. Otherwise there are other
>> issues involved --that likely should be avoided by
>> switching to partition style.
>
> so i kinda lied - initially i had just a 2G swap, but i added a second 20G swap a while ago to have enough space to capture some cores while testing drm-kmod work. based on this comment i am going to only use the 20G file backed swap and see how that goes.
>
> this is my fstab entry currently for the file backed swap:
> md99 none swap sw,file=/root/swap1,late 0 0

I think you may have taken my suggestion backwards . . .

Unfortunately, vnode (file) based swap space should be *avoided*
and partitions are what should be used in order to avoid deadlocks:

On 2017-Feb-13, at 7:20 PM, Konstantin Belousov <kostikbel at gmail.com> wrote
on the freebsd-arm list:

QUOTE
swapfile write requires the write request to come through the filesystem
write path, which might require the filesystem to allocate more memory
and read some data. E.g. it is known that any ZFS write request
allocates memory, and that write request on large UFS file might require
allocating and reading an indirect block buffer to find the block number
of the written block, if the indirect block was not yet read.

As result, swapfile swapping is more prone to the trivial and unavoidable
deadlocks where the pagedaemon thread, which produces free memory, needs
more free memory to make a progress. Swap write on the raw partition over
simple partitioning scheme directly over HBA are usually safe, while e.g.
zfs over geli over umass is the worst construction.
END QUOTE

The developers handbook has a section debugging deadlocks that he
referenced in a response to another report (on freebsd-hackers).

https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/#kerneldebug-deadlocks<https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/#kerneldebug-deadlocks>

>>>> ZFS (so with ARC)? UFS? Both?
>>> i am using ZFS and am setting my vfs.zfs.arc.max to 10G. i have also experienced this crash with that set to the default unlimited value as well.
>> I use ZFS on systems with at least 8 GiBytes of RAM,
>> but I've never tuned ZFS. So I'm not much help for
>> that side of things.
>
> since we started this thread I've gone ahead and removed the zfs.arc.max setting since its cruft at this point. i initially added it to test a configuration i deployed to a sever hosting a bunch of VMs.
>
>> I'm hoping that vm.pageout_oom_seq=120 (or more) makes it
>> so you do not have to have identified everything up front
>> and can explore easier.
>>
>>
>> Note that vm.pageout_oom_seq is both a loader tunable
>> and a writeable runtime tunable:
>>
>> # sysctl -T vm.pageout_oom_seq
>> vm.pageout_oom_seq: 120
>> amd64_ZFS amd64 1400053 1400053 # sysctl -W vm.pageout_oom_seq
>> vm.pageout_oom_seq: 120
>>
>> So you can use it to extend the time when the
>> machine is already running.
>
> fantastic. thanks again for taking your time and sharing your knowledge and experience with me Mark!
>
> these types of journeys are why i run current on my daily driver, it really helps me better understand the OS so that i can be a better admin on the "real" servers i run for work. its also just fun to learn stuff too heh.
>


===
Mark Millard
marklmi at yahoo.com

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.