Re: swap_pager: cannot allocate bio

From: Mark Johnston <markj_at_freebsd.org>
Date: Sat, 20 Nov 2021 18:09:46 UTC
On Mon, Nov 15, 2021 at 05:08:29PM +0200, Andriy Gapon wrote:
> On 15/11/2021 16:50, Mark Johnston wrote:
> > On Mon, Nov 15, 2021 at 04:20:26PM +0200, Andriy Gapon wrote:
> >> On 15/11/2021 05:26, Chris Ross wrote:
> >>> A procstat -kka output is available (208kb of text, 1441 lines) at
> >>> https://pastebin.com/SvDcvRvb
> >>
> >>      67 100542 pagedaemon          dom0                mi_switch+0xc1
> >> _cv_wait+0xf2 arc_wait_for_eviction+0x1df arc_lowmem+0xca
> >> vm_pageout_worker+0x3c4 vm_pageout+0x1d7 fork_exit+0x8a fork_trampoline+0xe
> >>
> >> I was always of an opinion that waiting for the ARC reclaim in arc_lowmem was
> >> wrong.  This shows an example of why it is so.
> >>
> >>> An ssh of a top command completed and shows:
> >>>
> >>> last pid: 91551;  load averages:  0.00,  0.02,  0.30  up 2+00:19:33    22:23:15
> >>> 40 processes:  1 running, 38 sleeping, 1 zombie
> >>> CPU:  3.9% user,  0.0% nice,  0.9% system,  0.0% interrupt, 95.2% idle
> >>> Mem: 58G Active, 210M Inact, 1989M Laundry, 52G Wired, 1427M Buf, 12G Free
> >>
> >> To me it looks like there is still plenty of free memory.
> >>
> >> I am not sure why vm_wait_domain (called by vm_page_alloc_noobj_domain) is not
> >> waking up.
> > 
> > It's a deadlock: the page daemon is sleeping on the arc evict thread,
> > and the arc evict thread is waiting for memory:
> 
> My point was that waiting for the free memory was not strictly needed yet given 
> 12G free, but that's kind of obvious.
> 
> >   2561 100722 zfskern             arc_evict
> >   mi_switch+0xc1 _sleep+0x1cb vm_wait_doms+0xe2 vm_wait_domain+0x51
> >   vm_page_alloc_noobj_domain+0x184 uma_small_alloc+0x62 keg_alloc_slab+0xb0
> >   zone_import+0xee zone_alloc_item+0x6f arc_evict_state+0x81 arc_evict_cb+0x483
> >   zthr_procedure+0xba fork_exit+0x8a fork_trampoline+0xe
> > 
> > I presume this is from the marker allocations in arc_evict_state().
> > 
> > The second problem is that UMA is refusing to try to allocate from the
> > "wrong" NUMA domain, but that policy seems overly strict.  Fixing that
> > alone would make the problem harder to hit, but I think it wouldn't
> > solve it completely.
> 
> Yes, I propose to remove the wait for ARC evictions from arc_lowmem().

The problem with this is that the page daemon won't account for ARC
evictions when reclaiming memory from the page queues.  We need to also
generalize the vm_lowmem eventhandler so that either
- caches can promise to free N pages and then shrink themselves
  asynchronously, or
- the page daemon can ask the cache to free N pages, where N is derived
  from the ratio of the cache size to the total amount of RAM, and the
  cache can be shrunk asynchronously.

I'm not sure how easy it is to get this information from the ARC.

> Another thing that may help a bit is having a greater "slack" between a 
> threshold where the page daemon starts paging out and a threshold where memory 
> allocations start to wait (via vm_wait_domain).
> 
> Also, I think that for a long time we had a problem (but not sure if it's still 
> present) where allocations succeeded without waiting until the free memory went 
> below certain threshold M, but once a thread started waiting in vm_wait it would 
> not be woken up until the free memory went above another threshold N.  And the 
> problem was that N >> M.  In other words, a lot of memory had to be freed (and 
> not grabbed by other threads) before the waiting thread would be woken up.

This is perhaps still an issue, though maybe not as noticeable now that
the page daemon runs more frequently and will set its target based on
the recent history of the current page shortage, rather than using
static high/low watermark thresholds.