RPI3 swap experiments ["was killed: out of swap space" with: "v_free_count: 5439, v_inactive_count: 1"]

Wed Aug 8 18:03:49 UTC 2018

On 2018-Aug-8, at 8:38 AM, bob prohaska <fbsd at www.zefox.net> wrote:

> On Mon, Aug 06, 2018 at 11:58:37AM -0400, Mark Johnston wrote:
>> On Wed, Aug 01, 2018 at 09:27:31PM -0700, Mark Millard wrote:
>>> [I have a top-posted introduction here in reply
>>> to a message listed at the bottom.]
>>> 
>>> Bob P. meet Mark J. Mark J. meet Bob P. I'm
>>> hopinh you can help Bob P. use a patch that
>>> you once published on the lists. This was from:
>>> 
>>> https://lists.freebsd.org/pipermail/freebsd-current/2018-June/069835.html
>>> 
>>> Bob P. has been having problems with an rpi3
>>> based buildworld ending up with "was killed:
>>> out of swap space" but when the swap partitions
>>> do not seem to be heavily used (seen via swapinfo
>>> or watching top).
>>> 
>>>> The patch to report OOMA information did its job, very tersely. The console reported
>>>> v_free_count: 5439, v_inactive_count: 1
>>>> Aug  1 18:08:25 www kernel: pid 93301 (c++), uid 0, was killed: out of swap space
>>>> 
>>>> The entire buildworld.log and gstat output are at
>>>> http://www.zefox.net/~fbsd/rpi3/swaptests/r336877M/
>>>> 
>>>> It appears that at 18:08:21 a write to the USB swap device took 530.5 ms, 
>>>> next top was killed and ten seconds later c++ was killed, _after_ da0b
>>>> was no longer busy.
>> 
>> My suspicion, based on the high latency, is that this is a consequence
>> of r329882, which lowered the period of time that the page daemon will
>> sleep while waiting for dirty pages to be cleaned.  If a certain number
>> of consecutive wakeups and queue scans occur without making progress,
>> the OOM killer is triggered.  That number is vm.pageout_oom_seq - could
>> you try increasing it by a factor of 10 and retry your test?
>> 
>>>> This buildworld stopped a quite a bit earlier than usual; most of the time
>>>> the buildworld.log file is close to 20 MB at the time OOMA acts. In this case
>>>> it was around 13 MB. Not clear if that's of significance.
>>>> 
>>>> If somebody would indicate whether this result is informative, and any possible
>>>> improvements to the test, I'd be most grateful. 
>> 
>> If the above suggestion doesn't help, the next thing to try would be to
>> revert the oom_seq value to the default, apply this patch, and see if
>> the problem continues to occur.  If this doesn't help, please try
>> applying both measures, i.e., set oom_seq to 120 _and_ apply the patch.
>> 
>> diff --git a/sys/vm/vm_pagequeue.h b/sys/vm/vm_pagequeue.h
>> index fb56bdf2fdfc..29a16060253f 100644
>> --- a/sys/vm/vm_pagequeue.h
>> +++ b/sys/vm/vm_pagequeue.h
>> @@ -74,7 +74,7 @@ struct vm_pagequeue {
>> } __aligned(CACHE_LINE_SIZE);
>> 
>> #ifndef VM_BATCHQUEUE_SIZE
>> -#define	VM_BATCHQUEUE_SIZE	7
>> +#define	VM_BATCHQUEUE_SIZE	1
>> #endif
>> 
>> struct vm_batchqueue {
> 
> The patched kernel ran longer than default but OOMA still halted buildworld around
> 13 MB. That's considerably farther than a default build world have run but less than
> observed when setting vm.pageout_oom_seq=120 alone. Log files are at
> http://www.zefox.net/~fbsd/rpi3/swaptests/r337226M/1gbsdflash_1gbusbflash/batchqueue/
> 
> Both changes are now in place and -j4 buildworld has been restarted. 
> 

I'll note that top has a "sort by RES"(ident) memory use that can
be an interesting order for these contexts, showing largest figures
first. -ores on the command line is one way to use that order.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)