RPI3 swap experiments ["was killed: out of swap space" with: "v_free_count: 5439, v_inactive_count: 1"]

Sun Aug 5 05:56:56 UTC 2018

On 2018-Aug-4, at 6:45 PM, John-Mark Gurney <jmg at funkthat.com> wrote:

> Mark Millard wrote this message on Sat, Aug 04, 2018 at 09:08 -0700:
>> On 2018-Aug-4, at 7:08 AM, John-Mark Gurney <jmg at funkthat.com> wrote:
>> 
>>> Mark Millard via freebsd-arm wrote this message on Sat, Aug 04, 2018 at 00:14 -0700:
>>>> On 2018-Aug-3, at 8:55 PM, Jamie Landeg-Jones <jamie at catflap.org> wrote:
>>>> 
>>>>> Mark Millard <marklmi at yahoo.com> wrote:
>>>>> 
>>>>>> If Inact+Laundry+Buf(?)+Free was not enough to provide sufficient
>>>>>> additional RAM, I'd would have guessed that some Active Real Memory
>>>>>> should then have been paged/swapped out and so RAM would be made
>>>>>> available. (This requires the system to have left itself sufficient
>>>>>> room in RAM for that guessed activity.)
>>>>>> 
>>>>>> But I'm no expert at the intent or actual operation.
>>>>>> 
>>>>>> Bob P.'s reports (for having sufficient swap space)
>>>>>> also indicate the likes of:
>>>>>> 
>>>>>> v_free_count: 5439, v_inactive_count: 1
>>>>>> 
>>>>>> 
>>>>>> So all the examples have: "v_inactive_count: 1".
>>>>>> (So: vmd->vmd_pagequeues[PQ_INACTIVE].pq_cnt==1 )
>>>>> 
>>>>> Thanks for the feedback. I'll do a few more runs and other stress tests
>>>>> to see if that result is consistent. I'm open to any other idea too!
>>>>> 
>>>> 
>>>> The book "The Design and Implementation of the FreeBSD Operating System"
>>>> (2nd edition, 2014) states (page labeled 296):
>>>> 
>>>> QUOTE:
>>>> The FreeBSD swap-out daemon will not select a runnable processes to swap
>>>> out. So, if the set of runnable processes do not fit in memory, the
>>>> machine will effectively deadlock. Current machines have enough memory
>>>> that this condition usually does not arise. If it does, FreeBSD avoids
>>>> deadlock by killing the largest process. If the condition begins to arise
>>>> in normal operation, the 4.4BSD algorithm will need to be restored.
>>>> END QUOTE.
>>>> 
>>>> As near as I can tell, for the likes of rpi3's and rpi2's, the condition
>>>> is occurring during buildworld "normal operation" that tries to use the
>>>> available cores to advantage. (Your context does not have the I/O
>>>> problems that Bob P.'s have had in at least some of your OOM process
>>>> kill examples, if I understand right.)
>>>> 
>>>> (4.4BSD used to swap out the runnable process that had been resident
>>>> the longest, followed by the processes taking turns being swapped out.
>>>> I'll not quote the exact text about such.)
>>>> 
>>>> So I guess the question becomes, is there a reasonable way to enable
>>>> the 4.4BSD style of "Swapping" for "small" memory machines in order to
>>>> avoid having to figure out how to not end up with OOM process kills
>>>> while also not just wasting cores by using -j1 for buildworld?
>>>> 
>>>> In other words: enable swapping out active RAM when it eats nearly
>>>> all the non-wired RAM.
>>>> 
>>>> But it might be discovered that the performance is not better than
>>>> using fewer cores during buildworld. (Experiments needed and
>>>> possibly environment specific for the tradeoffs.) Avoiding having
>>>> to figure out the maximum -j? that avoids OOM process kills but
>>>> avoids just sticking to -j1 seems and advantage for some rpi3 and
>>>> rpi2 folks.
>>> 
>>> Interesting observation, maybe playing w/:
>>> vm.swap_idle_threshold2: Time before a process will be swapped out
>>> vm.swap_idle_threshold1: Guaranteed swapped in time for a process
>>> 
>>> will help thing...  lowering 2 will likely make the processes available
>>> for swap sooner...
>> 
>> Looking up related information:
>> 
>> https://www.freebsd.org/doc/handbook/configtuning-disk.html
>> 
>> says vm.swap_idle_enabled is also involved with those two. In fact
>> it indicates the two are not even used until vm.swap_idle_enabled=1 .
>> 
>> QUOTE
>> 11.10.1.4. vm.swap_idle_enabled
>> The vm.swap_idle_enabled sysctl(8) variable is useful in large multi-user systems with many active login users and lots of idle processes. Such systems tend to generate continuous pressure on free memory reserves. Turning this feature on and tweaking the swapout hysteresis (in idle seconds) via vm.swap_idle_threshold1 and vm.swap_idle_threshold2 depresses the priority of memory pages associated with idle processes more quickly then the normal pageout algorithm. This gives a helping hand to the pageout daemon. Only turn this option on if needed, because the tradeoff is essentially pre-page memory sooner rather than later which eats more swap and disk bandwidth. In a small system this option will have a determinable effect, but in a large system that is already doing moderate paging, this option allows the VM system to stage whole processes into and out of memory easily.
>> END QUOTE
>> 
>> The defaults seem to be:
>> 
>> # sysctl vm.swap_idle_enabled vm.swap_idle_threshold1 vm.swap_idle_threshold2
>> vm.swap_idle_enabled: 0
>> vm.swap_idle_threshold1: 2
>> vm.swap_idle_threshold2: 10
>> 
>> Quoting the book again:
>> 
>> QUOTE
>> If the swapping of idle processes is enabled and the pageout daemon can find any
>> processes that have been sleeping for more than 10 seconds (swap_idle_threshold2,
>> the cutoff for considering the time sleeping to be "a long time"), it will swap
>> them all out. [. . .] if none of these processes are available, the pageout
>> daemon will swap out all processes that has been sleeping for as briefly as 2
>> seconds (swap_idle_threshold1).
>> END QUOTE.
>> 
>> I'd not normally expect a compile or link to sleep for such long periods
>> (unless I/O has long delays). Having, say, 4 such processes active at the
>> same time may be unlikely to have any of them swap out on the default scale.
>> (Clang is less I/O bound and more memory bound than GCC as I remember what
>> I've observed. That statement ignores paging/swapping by the system.)
>> 
>> Such would likely be true on the scale of any positive integer seconds
>> figures?
> 
> The point is to more aggressively swap out OTHER processes so that
> there is more memory available.

I guess I'm relying on what I've seen in top to indicate that most all
of the space from other processes has been paged out: not much of Active
is for non-compiles/non-links during the problem times.

For example, in http://www.catflap.org/jamie/rpi3/rpi3-mmc-swap-failure-stats.txt
it lists (last before the kill):

last pid: 30806;  load averages:  4.05,  4.04,  4.00  up 0+02:03:06    10:39:59
42 processes:  5 running, 37 sleeping
CPU: 88.5% user,  0.0% nice,  6.1% system,  0.4% interrupt,  5.0% idle
Mem: 564M Active, 2M Inact, 68M Laundry, 162M Wired, 97M Buf, 104M Free
Swap: 4G Total, 76M Used, 4G Free, 1% Inuse

  PID USERNAME    THR PRI NICE  SIZE   RES STATE    C   TIME    WCPU COMMAND
30762 root          1 101    0  175M  119M CPU2     2   0:39  99.07% c++
30613 root          1 101    0  342M  191M CPU0     0   2:02  95.17% c++
30689 root          1 101    0  302M  226M CPU3     3   1:28  94.48% c++
22226 root          1  20    0   19M    2M select   0   0:31   0.00% make
 1021 root          1  20    0   12M  340K wait     2   0:07   0.00% sh

Rule of thumb figures:
564M Active
vs. RES for the 3 c++'s:
119M+191M+226M = 536M for the 3 c++'s.

So: 564M - 536M = 28M (approx. active for other processes)

It appears to me that some c++ would likely need to swap out given that
this context lead to OOM kills.

(It might be that this rule of thumb is not good enough
for such judgments.)

[Personally I normally limit myself to -jN figures that have N*512 MiBytes
or more on the board. -j4 on a rpi3 or rpi2 has only 4*256 MiBytes.]

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)