[SOLVED] Re: Strange behavior after running under high load

Sun Apr 4 19:24:04 UTC 2021

On Sun, Apr 04, 2021 at 07:01:44PM +0000, Poul-Henning Kamp wrote:
> --------
> Konstantin Belousov writes:
> 
> > But what would you provide as the input for PID controller, and what would be the targets?
> 
> Viewing this purely as a vnode related issue is wrong, this is about memory allocation in general.
> 
> We may or may not want a PID regulator, but putting it on counts of vnode would not improve things, precisely, as you point out, because the amount of memory a vnode ties up has enormous variance.
> 
Yes

> 
> We should focus on the end goal: To ensure "sufficient" memory can always be allocated for any purpose "without major delay".
> 
and no

> 
> Architecturally there are three major problems:
> 
> A) While each subsystem generally have a good idea about memory that can be released "without major delay", the information does not trickle up through a summarizing NUMA aware tree.
> 
> B) We lack a nuanced call-back to tell the subsystems to release some of their memory "without major delay".
The delay in the wall clock sense does not drive the issue.
We cannot expect any io to proceed while we are low on memory, in the sense
that allocators cannot respond right now.  More and more, our io subsystem
requires allocating memory to make any progress with io.  This is already
quite bad with geom, although some hacks make it not too outstanding.

It is very bad with ZFS, where swap on zvols causes deadlocks almost
immediately.

> 
> C) We have never attempted to enlist userland, where jemalloc often hang on to a lot of unused VM pages.
> 
The userland does not add to this problem, because pagedaemon typically has
enough processing power to convert user-allocated pages into usable clean
or free pages.  Of course, if there is no swap and dirty anon page cannot
be launder, the issue would accumulate.

But normally operating system does not have an issue with user pages.  

> 
> As far as vnodes go:
> 
> 
> It used to be that "without major delay" meant "without disk-I/O" which again led to the "dirty buffers/VM pages" heuristic.
> 
> With microsecond SSD backing store, that heuristic is not only invalid, it is down-right harmful in many cases.
> 
> GEOM maintains estimates of per-provider latency and VM+VFS should use that to schedule write-back so that more of it happens outside rush-hour, in order to increase the amount of memory which can be released "without major delay".
> 
> Today that happens largely as a side effect of the periodic syncer, which does a really bad job at it, because it still expects VAX-era hardware performance and workloads.
> 
Io latency is not the factor there. We must avoid situations where
instantiating a vnode stalls waiting for KVA to appear, similarly we
must avoid system state where vnodes allocation consumed so much kmem
that other allocations stall.

Quite indicative is that we do not shrink the vnode list on low memory
events.  Vnlru also does not account for the memory pressure.

Problem is that it is not clear how to express that relations between
safe allocators state and our desire to cache file system data, which is
bound to the vnode identity.