The out-of-swap killer makes poor choices
    Konstantin Belousov 
    kostikbel at gmail.com
       
    Wed Feb 24 11:02:38 UTC 2021
    
    
  
On Tue, Feb 23, 2021 at 04:29:46PM -0700, Alan Somers wrote:
> On Tue, Feb 23, 2021 at 3:36 PM Konstantin Belousov <kostikbel at gmail.com>
> wrote:
> 
> > On Tue, Feb 23, 2021 at 02:20:21PM -0700, Alan Somers wrote:
> > > On Tue, Feb 23, 2021 at 2:11 PM Konstantin Belousov <kostikbel at gmail.com
> > >
> > > wrote:
> > >
> > > > On Tue, Feb 23, 2021 at 01:49:49PM -0700, Alan Somers wrote:
> > > > > To me it's always seemed like the out-of-swap killer kills the wrong
> > > > > process.  Oh, it does the right thing with a trivial while(1)
> > {malloc()}
> > > > > test program, but not with real workloads.  To summarize the logic in
> > > > > vm_pageout_oom:
> > > > >
> > > > > * Don't kill system, protected, or killed processes
> > > > > * Don't kill processes with a thread that isn't running or suspended
> > > > > * Kill whichever process is using the most swap or swap + ram,
> > depending
> > > > on
> > > > > the shortage variable.  On ties, kill the newest one.
> > > > >
> > > > > This algorithm probably made sense in the days when computers had
> > much
> > > > more
> > > > > swap than RAM.  But now it leads to several problems:
> > > > >
> > > > > * It's almost guaranteed to do the wrong thing when shortage ==
> > > > > VM_OOM_SWAPZ and there is little or no swap configured.  If no swap
> > is
> > > > > configured, it will kill the newest running or suspended process.
> > If a
> > > > > little bit is configured, it will probably kill some idle process,
> > like
> > > > > zfsd, that is swapped out because it doesn't run very often.
> > > > >
> > > > > * Even if multiple GB of swap are configured, the OOM killer is still
> > > > > biased towards killing idle processes when shortage == VM_OOM_SWAPZ.
> > > > Most
> > > > > often, the process responsible for an out-of-memory condition is not
> > > > idle,
> > > > > and is consuming large amounts of RAM.
> > > > >
> > > > > * It ignores RLIMIT_RSS.  We consider that rlimit when deciding
> > whether
> > > > to
> > > > > move a process from RAM to swap.
> > > > >
> > > > > * The "out of swap space" kernel message doesn't specify whether the
> > > > > process was killed because of insufficient swap or RAM (the shortage
> > > > > variable)
> > > > >
> > > > > I propose the following changes:
> > > > >
> > > > > * Incorporate shortage into the "out of swap space" message.
> > > > ok with me, not sure if users could make any action based on discretion
> > > >
> > > > > * When walking the process list, if any process exceeds its
> > RLIMIT_RSS,
> > > > > choose it immediately, without bothering to compare it to older
> > > > processes.
> > > > RSS was never supposed to be a limit on how many pages are resident.
> > > > It only provided some preference for more aggressive paging out
> > process'
> > > > pages.
> > > >
> > > > Or put it differently, RSS is not supposed to be the working set size
> > > > in VMS/NT sense.
> > > >
> > >
> > > Sure, but given that we must kill _something_, preferentially killing a
> > > process that was specifically limited sounds better than killing a
> > process
> > > that wasn't, won't you agree?
> > Semantic of RLIMIT_RSS is not to limit, but to give preference for pageout.
> > Changing it to the semantic of 'preference for OOM' would give the similar
> > complaint.
> >
> > >
> > >
> > > >
> > > > > * Always consider the sum of a process's RAM + swap, regardless of
> > the
> > > > > shortage variable.
> > > > >
> > > > > Does this make sense?  Am I missing something about shortage ==
> > > > > VM_OOM_SWAPZ?  I don't understand why you would ever want to exclude
> > > > > processes' RAM usage.  That logic was added in revision
> > > > > 2025d69ba7a68a5af173007a8072c45ad797ea23, but I don't understand the
> > > > > rationale.
> > > >
> > > > SWAPZ means that swap zone is exhausted.  In this case, killing a
> > process
> > > > that does not use swap, would not free any space in the zone.
> > Similarly,
> > > > we should select a process with largest swap (== metadata kept in swap
> > > > zone)
> > > > use to free something in swap zone.
> > > >
> > >
> > > But killing a process that does not use swap could reduce the need for
> > more
> > > swap by other processes.  How many cases are there where a process needs
> > > more SWAP and won't settle for RAM instead?
> > Both choices are somewhat random.  The goal is to get more swap zone slack,
> > and this is what the code tried to target.
> >
> > In fact, if OOM kills largest RAM+swap consumer, then with the small swap
> > there is huge chance that swap is not freed, and then on the next nearby
> > pageout attempt some more process would be killed, perhaps innocently.
> >
> > OOM purpose is not to smoother operation of over-committed system, but
> > to have it survive (avoid low resources deadlock) to the state where it
> > can be examined and possibly corrected.
> >
> > >
> > >
> > > >
> > > > In other words, such kill could be not enough and really require more
> > and
> > > > more rounds of OOM, esp. on machine with very small swap configured.
> >
> >
> Ok, I'll abandon this idea.
No OOM algorithm would ever satisfy everybody.
I explained the reasoning for the current design, even if it actually
evolved this way, instead being written as a whole with the stated goal.
I do not object against adding something that would help to get it more
fit with different goals as well, but the current idea of making the
system survive should be kept.
I remember Linux has more advanced controls to guide OOM decisions.
We only have 'protected' flag that should prevent killer from ever
touching specific process, like sshd.
    
    
More information about the freebsd-hackers
mailing list