Memory management issue on RPi?

Fri Nov 13 18:32:49 UTC 2015

On Fri, Nov 13, 2015 at 08:12:04AM -0700, Warner Losh wrote:
> On Fri, Nov 13, 2015 at 1:23 AM, Michael Tuexen <tuexen at freebsd.org> wrote:
> 
> > > On 12 Nov 2015, at 21:03, Konstantin Belousov <kostikbel at gmail.com>
> > wrote:
> > >
> > > On Thu, Nov 12, 2015 at 08:47:29PM +0100, Michael Tuexen wrote:
> > >>> On 12 Nov 2015, at 19:09, Konstantin Belousov <kostikbel at gmail.com>
> > wrote:
> > >>>
> > >>> On Thu, Nov 12, 2015 at 06:57:03PM +0100, Michael Tuexen wrote:
> > >>>>> On 12 Nov 2015, at 18:12, Konstantin Belousov <kostikbel at gmail.com>
> > wrote:
> > >>>>>
> > >>>>> On Thu, Nov 12, 2015 at 05:25:37PM +0100, Michael Tuexen wrote:
> > >>>>>>> On 12 Nov 2015, at 13:18, Konstantin Belousov <kostikbel at gmail.com>
> > wrote:
> > >>>>>>> This is a known problem with the swap-less OOM.  The following
> > patch
> > >>>>>>> should give you an immediate relief.  You might want to tweak
> > >>>>>>> sysctl vm.pageout_oom_seq if default value is not right, it was
> > selected
> > >>>>>>> by 'try and see' approach on very small (32 or 64MB) i386 VM.
> > >>>>>> It just works... Will do some more testing...
> > >>>>>
> > >>>>> I am more interested in report if OOM was triggered when it should.
> > >>>> How do I know? What output do you want to see?
> > >>>>
> > >>>> Best regards
> > >>>> Michael
> > >>>>>
> > >>>>> Try running several instances of 'sort /dev/zero'.
> > >>> ^^^^^^^^^^^^^ I already answered this.
> > >>> Run sort /dev/zero, and see whether OOM fires.
> > >> OK, now I understand. You want to see if some processes are getting
> > killed.
> > >> (I was thinking that you might want to see some sysctl counters or so).
> > >>
> > >> Results:
> > >> * I'm able to compile/link/install a kernel from source. This was not
> > >>  possible before.
> > >> * When running three instances of sort /dev/zero, two of them get killed
> > >>  after a while (less than a minute). One continued to run, but got also
> > >>  kill eventually. All via ssh login.
> > > Exactly, this is the experiment I want to occur, and even more, the
> > results
> > > are good.
> > Any plans to commit it?
> >
> 
> These changes are good as an experiment. The RPi's relative speed
> of the CPU to the extremely slow SD card where pages are laundered
> to. Deferring the calls to the actual OOM a bit is useful. However,
> a simple count won't self-scale. What's good for the RPi likely is
> likely poor for a CPU connected to faster storage. The OOM won't kill
> things quickly enough in those circumstances. I imagine that there may
> be a more complicated relationship between the rate of page dirtying
> and laundering.
The biggest problematic case fixed by this approach is *swap-less*
setup, where the speed of the slow storage does not matter at all for
the speed of pagedaemon, since there is no swap.

>
> I'd hope that there'd be some kind of scaling that would take this
> variation into account.
>
> At Netflix, we're developing some patches to do more pro-active
> laundering of pages rather than waiting for the page daemon to kick
> in. We do this primarily to avoid flushing the uma caches which have
> performance implications that we need to address to smooth out the
> performance. Perhaps something like this would be a more general way
> to cope with this situation?
Page laundering speed cannot be a factor in deciding to trigger OOM.
If you can clean up something, then OOM must not be fired.

The patch does not trigger OOM when no progress is made, immediately,
because it expects that some delay might indeed allow the async io
to finish and provide some pages to cover the deficit. Only when the
progress stalls completely, the ticking for OOM starts.

Several iterations are performed before the deadlock is claimed. There
is no good heuristic which I could formulate to provide suitable
iterations count. But the current value was tested on both small
(32-64M) and large (32GB) machines and found satisfactory. Even then,
it is run-time tunable to allow to set it by operator for better-suited
value.

OOM means that the user data is lost. Netflix might not care, due to
the specifics of the load, but I and most other users do care about
their data. I always prefer the kernel flushing the caches (not only UMA
caches, but also pv entries, UFS dirhashes, GPU unpinned buffers etc)
over deadlocking or killing the browser where I filled a long form, or
text editor, or any other unrecoverable state.  If OOM is not fatal
for your data, you can reduce the value of the tunable to prefer kernel
caches over the user data.

And, to make it clear, the current code which triggers OOM does not make
much sense. It mostly takes the count of free pages as the indicator
of OOM condition, which fails to account the simple fact that queued
pages may be laundred or discarded. As result, false OOM is triggered,
and it is easier to get false trigger on swap-less system due to swap
being always 'full'.  This is orthogonal to the issue of the pagedaemon
performance.