Memory management issue on RPi?

Fri Nov 13 19:58:11 UTC 2015

On Fri, Nov 13, 2015 at 11:32 AM, Konstantin Belousov <kostikbel at gmail.com>
wrote:

> On Fri, Nov 13, 2015 at 08:12:04AM -0700, Warner Losh wrote:
> > On Fri, Nov 13, 2015 at 1:23 AM, Michael Tuexen <tuexen at freebsd.org>
> wrote:
> >
> > > > On 12 Nov 2015, at 21:03, Konstantin Belousov <kostikbel at gmail.com>
> > > wrote:
> > > >
> > > > On Thu, Nov 12, 2015 at 08:47:29PM +0100, Michael Tuexen wrote:
> > > >>> On 12 Nov 2015, at 19:09, Konstantin Belousov <kostikbel at gmail.com
> >
> > > wrote:
> > > >>>
> > > >>> On Thu, Nov 12, 2015 at 06:57:03PM +0100, Michael Tuexen wrote:
> > > >>>>> On 12 Nov 2015, at 18:12, Konstantin Belousov <
> kostikbel at gmail.com>
> > > wrote:
> > > >>>>>
> > > >>>>> On Thu, Nov 12, 2015 at 05:25:37PM +0100, Michael Tuexen wrote:
> > > >>>>>>> On 12 Nov 2015, at 13:18, Konstantin Belousov <
> kostikbel at gmail.com>
> > > wrote:
> > > >>>>>>> This is a known problem with the swap-less OOM.  The following
> > > patch
> > > >>>>>>> should give you an immediate relief.  You might want to tweak
> > > >>>>>>> sysctl vm.pageout_oom_seq if default value is not right, it was
> > > selected
> > > >>>>>>> by 'try and see' approach on very small (32 or 64MB) i386 VM.
> > > >>>>>> It just works... Will do some more testing...
> > > >>>>>
> > > >>>>> I am more interested in report if OOM was triggered when it
> should.
> > > >>>> How do I know? What output do you want to see?
> > > >>>>
> > > >>>> Best regards
> > > >>>> Michael
> > > >>>>>
> > > >>>>> Try running several instances of 'sort /dev/zero'.
> > > >>> ^^^^^^^^^^^^^ I already answered this.
> > > >>> Run sort /dev/zero, and see whether OOM fires.
> > > >> OK, now I understand. You want to see if some processes are getting
> > > killed.
> > > >> (I was thinking that you might want to see some sysctl counters or
> so).
> > > >>
> > > >> Results:
> > > >> * I'm able to compile/link/install a kernel from source. This was
> not
> > > >>  possible before.
> > > >> * When running three instances of sort /dev/zero, two of them get
> killed
> > > >>  after a while (less than a minute). One continued to run, but got
> also
> > > >>  kill eventually. All via ssh login.
> > > > Exactly, this is the experiment I want to occur, and even more, the
> > > results
> > > > are good.
> > > Any plans to commit it?
> > >
> >
> > These changes are good as an experiment. The RPi's relative speed
> > of the CPU to the extremely slow SD card where pages are laundered
> > to. Deferring the calls to the actual OOM a bit is useful. However,
> > a simple count won't self-scale. What's good for the RPi likely is
> > likely poor for a CPU connected to faster storage. The OOM won't kill
> > things quickly enough in those circumstances. I imagine that there may
> > be a more complicated relationship between the rate of page dirtying
> > and laundering.
> The biggest problematic case fixed by this approach is *swap-less*
> setup, where the speed of the slow storage does not matter at all for
> the speed of pagedaemon, since there is no swap.
>
> >
> > I'd hope that there'd be some kind of scaling that would take this
> > variation into account.
> >
> > At Netflix, we're developing some patches to do more pro-active
> > laundering of pages rather than waiting for the page daemon to kick
> > in. We do this primarily to avoid flushing the uma caches which have
> > performance implications that we need to address to smooth out the
> > performance. Perhaps something like this would be a more general way
> > to cope with this situation?
> Page laundering speed cannot be a factor in deciding to trigger OOM.
> If you can clean up something, then OOM must not be fired.
>
> The patch does not trigger OOM when no progress is made, immediately,
> because it expects that some delay might indeed allow the async io
> to finish and provide some pages to cover the deficit. Only when the
> progress stalls completely, the ticking for OOM starts.
>

I don't understand the argument you're making below that the
speed we launder pages doesn't matter coupled with the argument
here that you are waiting for async I/O to complete helping avoid
declaring OOM.

My main concern with a counting heuristic is that it doesn't seem
to take into effect how long the I/O takes directly and relies on the
count to do that instead. It's cool that the count is a sysctl, but it
seems that a more automatic scaling to the I/O speed or rate would
be better.

> Several iterations are performed before the deadlock is claimed. There
> is no good heuristic which I could formulate to provide suitable
> iterations count. But the current value was tested on both small
> (32-64M) and large (32GB) machines and found satisfactory. Even then,
> it is run-time tunable to allow to set it by operator for better-suited
> value.
>
> OOM means that the user data is lost. Netflix might not care, due to
> the specifics of the load, but I and most other users do care about
> their data. I always prefer the kernel flushing the caches (not only UMA
> caches, but also pv entries, UFS dirhashes, GPU unpinned buffers etc)
> over deadlocking or killing the browser where I filled a long form, or
> text editor, or any other unrecoverable state.  If OOM is not fatal
> for your data, you can reduce the value of the tunable to prefer kernel
> caches over the user data.
>

I agree that's desirable. The changes we made keep more pages available
to the system so we don't hit the low memory situation in the page deamon
as often. That's what keeps the uma cachces and other low memory handlers
in the system from firing in our systems. Not that we force them not to
fire,
but rather more pages are available more quickly than on a system where
the page daemon alone is triggering scans. Netflix's workload shuffles a lot
of pages into and out of memory, so proactive laundering before there's
a big shortage helps a lot.

> And, to make it clear, the current code which triggers OOM does not make
> much sense. It mostly takes the count of free pages as the indicator
> of OOM condition, which fails to account the simple fact that queued
> pages may be laundred or discarded. As result, false OOM is triggered,
> and it is easier to get false trigger on swap-less system due to swap
> being always 'full'.  This is orthogonal to the issue of the pagedaemon
> performance.
>

The thinking that I had was that the proactive laundering patches we have
push things out to disk more quickly and return pages to the free pool
more quickly than waiting for the page daemon to do it. This would keep
more free pages in the system more quickly than the current setup which
should help avoid triggering OOM in the first place.

I agree that the current OOM code triggers aren't the best as well.

To be clear, I'm also not trying to stand in the way of committing the
code. I'm trying to ask questions to see if there's a better way to
accomplish the same thing.

Warner