RPI3 swap experiments (r338342 with vm.pageout_oom_seq="1024" and 6 GB swap)

Thu Sep 6 08:04:49 UTC 2018

On 2018-Sep-6, at 12:08 AM, bob prohaska <fbsd at www.zefox.net> wrote:

> On Wed, Sep 05, 2018 at 11:20:14PM -0700, Mark Millard wrote:
>> 
>> 
>> On 2018-Sep-5, at 9:23 PM, bob prohaska <fbsd at www.zefox.net> wrote:
>> 
>>> On Wed, Sep 05, 2018 at 07:43:52PM -0700, Rodney W. Grimes wrote:
>>>> 
>>>> What makes you believe that the VM system has any concept about
>>>> the speed of swap devices?  IIRC it simply uses them in a round
>>>> robbin fashion with no knowlege of them being fast or slow, or
>>>> shared with files systems or other stuff.
>>>> 
>>> 
>>> Mostly the assertion that OOMA kills happening when the system had
>>> plenty of free swap were caused by the swap being "too slow". If the
>>> machine knows some swap is slow, it seems capable of discerning other
>>> swap is faster. 
>> 
>> If an RPI3 magically had a full-speed/low-latency optane context
>> as its swap space, it would still get process kills for buildworld
>> buildkernel for vm.pageout_oom_seq=12 for -j4 as I understand
>> things at this point. (Presumes still having 1 GiByte of RAM.)
>> 
>> In other words: the long latency issues you have in your rpi3
>> configuration may contribute to the detailed "just when did it
>> fail" but low-latency/high-speed I/O would be unlikely to prevent
>> kills from eventually happening during the llvm parts of buildworld .
>> Free RAM would still be low for "long periods". Increasing
>> vm.pageout_oom_seq is essential from what I can tell.
>> 
> Understood and accepted. I'm using  vm.pageout_oom_seq=1024 at present.
> The system struggles mightily, but it keeps going and finishes.
> 
>> vm.pageout_oom_seq is about controlling "how long". -j1 builds are
>> about keeping less RAM active. (That is also the intent for use of
>> LDFLAGS.lld+=-Wl,--no-threads .) Of course, for the workload involved,
>> using a context with more RAM can avoid having "low RAM" for
>> as long. An aarch64 board with 4 GiBYte of RAM and 4 cores possibly
>> has no problem for -j4 buildworld buildkernel for head at this
>> point: Free RAM might well never be low during such a build in such
>> a context.
>> 
>> (The quotes like "how long" are because I refer to the time
>> consequences, the units are not time but I'm avoiding the detail.)
>> 
>> The killing criteria do not directly measure and test swapping I/O
>> latencies or other such as far as I know. Such things are only
>> involved indirectly via other consequences of the delays involved
>> (when they are involved at all). That is my understanding.
>> 
> Perhaps I'm being naive here, but when one sees two devices holding
> swap, one at ~25% busy and one at ~150% busy, it seems to beg for
> a little selective pressure for diverting traffic to the less busy
> device from the more busy one. Maybe it's impossible, maybe it's more
> trouble than the VM folks want to invest. Just maybe, it's doable
> and worthwhile, to take advantage of a cheap, power efficient platform.

Continuously reorganize were things page out to in the swap partitions
on the fly? (The reads have to be from where it was paged out.) What
virtual memory pages will be written out and read in and how often is
not known up front and is not predicable and varies over time.

That some partitions end up with more active paging than others is
not all that surprising.

Avoiding such would have overhead that was always involved --and would
use even more RAM for the extra tracking. It would also involve trying
to react as fast as the demand changes: the full speed of the programs
that are keeping "cores" and RAM in active use.

> I too am unsure of the metric for "too slow". From earlier discussion
> I got the impression it was something like a count of how many cycles
> of request and rejection (more likely, deferral) for swap space were
> made; after a certain count is reached, OOMA is invoked. That picture
> is sure to be simplistic, and may well be flat-out wrong.

Note that writing out a now-dirty RAM page that was already written out
before need not allocate any new swap space: reuse the place it used
before. But if the RAM page is in active enough use, writing it out and
freeing the page could just lead to it being read back in (allocating
a free RAM page nearly immediately). Such can be viewed be a waste of
time.

Another wording for this is something like "the system working set" can
be so large that paging becomes ineffective (performs too poorly).

Having virtual memory "thrashing" can slow things by orders of
magnitude.

It has more to do with something like counting attempts at moving
dirty RAM pages out that have not been used as recently in order to
hopefully free the RAM pages written out. (But the RAM pages may
become active again before they are freed.) Clean RAM pages can more
directly be freed but there is still an issue of such a page being in
active use to the point that freeing it would just lead to reading
it back in to a newly allocated RAM page.

The handling of this need not use up all the swap space before not
making progress at freeing RAM and so not keeping a sufficient
amount of it free. Thrashing does not require the swap space all
be used.

There might be no swap space set up at all. (I've done this on a 128
GiByte RAM configuration.) "OOM" kills still can happen: dirty pages
and clean pages have no place to be written out to in order to increase
free memory. If free RAM gets low, OOM kills start.

> If my picture is not wholly incorrect, it isn't a huge leap to ask for
> swap device-by-device, and accept swap from the device that offers it first.
> In the da0 vs mmcsd0 case, ask for swap on each in turn, first to say yes gets
> the business. The busier one will will get beaten in the race by the more
> idle device, relieving the bottleneck to the extent of the faster device's
> capacity. It isn't perfect, but it's an improvement.

See my comments above about needing to reorganizes where the RAM
pages go in the swap partitions on the fly over time.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)