OOM killer and kernel cache reclamation rate limit in vm_pageout_scan()

Fri Oct 17 14:32:10 UTC 2014

On Wed, Oct 15, 2014 at 11:56:33PM -0600, Justin T. Gibbs wrote:
> avg pointed out the rate limiting code in vm_pageout_scan() during discussion about PR 187594.  While it certainly can contribute to the problems discussed in that PR, a bigger problem is that it can allow the OOM killer to be triggered even though there is plenty of reclaimable memory available in the system.  Any load that can consume enough pages within the polling interval to hit the v_free_min threshold (e.g. multiple 'dd if=/dev/zero of=/file/on/zfs') can make this happen.
> 
> The product I?m working on does not have swap configured and treats any OOM trigger as fatal, so it is very obvious when this happens. :-)
> 
> I?ve tried several things to mitigate the problem.  The first was to ignore rate limiting for pass 2.  However, even though ZFS is guaranteed to receive some feedback prior to OOM being declared, my testing showed that a trivial load (a couple dd operations) could still consume enough of the reclaimed space to leave the system below its target at the end of pass 2.  After removing the rate limiting entirely, I?ve so far been unable to kill the system via a ZFS induced load.
> 
> I understand the motivation behind the rate limiting, but the current implementation seems too simplistic to be safe.  The documentation for the Solaris slab allocator provides good motivation for their approach of using a ?sliding average? to reign in temporary bursts of usage without unduly harming efficient service for the recorded steady-state memory demand.  Regardless of the approach taken, I believe that the OOM killer must be a last resort and shouldn?t be called when there are caches that can be culled.
> 
> One other thing I?ve noticed in my testing with ZFS is that it needs feedback and a little time to react to memory pressure.  Calling it?s lowmem handler just once isn?t enough for it to limit in-flight writes so it can avoid reuse of pages that it just freed up.  But, it doesn?t take too long to react (> 1sec in the profiling I?ve done).  Is there a way in vm_pageout_scan() that we can better record that progress is being made (pages were freed in the pass, even if some/all of them were consumed again) and allow more passes before the OOM killer is invoked in this case?
> 
> ?
> Justin
https://docs.freebsd.org/cgi/getmsg.cgi?fetch=103436+0+/usr/local/www/db/text/2014/freebsd-hackers/20141012.freebsd-hackers
might have some relevance.