Re: swap_pager: cannot allocate bio

From: Andriy Gapon <avg_at_freebsd.org>
Date: Mon, 15 Nov 2021 15:08:29 UTC
On 15/11/2021 16:50, Mark Johnston wrote:
> On Mon, Nov 15, 2021 at 04:20:26PM +0200, Andriy Gapon wrote:
>> On 15/11/2021 05:26, Chris Ross wrote:
>>> A procstat -kka output is available (208kb of text, 1441 lines) at
>>> https://pastebin.com/SvDcvRvb
>>
>>      67 100542 pagedaemon          dom0                mi_switch+0xc1
>> _cv_wait+0xf2 arc_wait_for_eviction+0x1df arc_lowmem+0xca
>> vm_pageout_worker+0x3c4 vm_pageout+0x1d7 fork_exit+0x8a fork_trampoline+0xe
>>
>> I was always of an opinion that waiting for the ARC reclaim in arc_lowmem was
>> wrong.  This shows an example of why it is so.
>>
>>> An ssh of a top command completed and shows:
>>>
>>> last pid: 91551;  load averages:  0.00,  0.02,  0.30  up 2+00:19:33    22:23:15
>>> 40 processes:  1 running, 38 sleeping, 1 zombie
>>> CPU:  3.9% user,  0.0% nice,  0.9% system,  0.0% interrupt, 95.2% idle
>>> Mem: 58G Active, 210M Inact, 1989M Laundry, 52G Wired, 1427M Buf, 12G Free
>>
>> To me it looks like there is still plenty of free memory.
>>
>> I am not sure why vm_wait_domain (called by vm_page_alloc_noobj_domain) is not
>> waking up.
> 
> It's a deadlock: the page daemon is sleeping on the arc evict thread,
> and the arc evict thread is waiting for memory:

My point was that waiting for the free memory was not strictly needed yet given 
12G free, but that's kind of obvious.

>   2561 100722 zfskern             arc_evict
>   mi_switch+0xc1 _sleep+0x1cb vm_wait_doms+0xe2 vm_wait_domain+0x51
>   vm_page_alloc_noobj_domain+0x184 uma_small_alloc+0x62 keg_alloc_slab+0xb0
>   zone_import+0xee zone_alloc_item+0x6f arc_evict_state+0x81 arc_evict_cb+0x483
>   zthr_procedure+0xba fork_exit+0x8a fork_trampoline+0xe
> 
> I presume this is from the marker allocations in arc_evict_state().
> 
> The second problem is that UMA is refusing to try to allocate from the
> "wrong" NUMA domain, but that policy seems overly strict.  Fixing that
> alone would make the problem harder to hit, but I think it wouldn't
> solve it completely.

Yes, I propose to remove the wait for ARC evictions from arc_lowmem().

Another thing that may help a bit is having a greater "slack" between a 
threshold where the page daemon starts paging out and a threshold where memory 
allocations start to wait (via vm_wait_domain).

Also, I think that for a long time we had a problem (but not sure if it's still 
present) where allocations succeeded without waiting until the free memory went 
below certain threshold M, but once a thread started waiting in vm_wait it would 
not be woken up until the free memory went above another threshold N.  And the 
problem was that N >> M.  In other words, a lot of memory had to be freed (and 
not grabbed by other threads) before the waiting thread would be woken up.

>> Perhaps this is some sort of a NUMA related issue where one memory domain is
>> exhausted while other(s) still have  a lot of memory.
>> Or maybe it's something else but it must be some sort of a bug.
>>
>>> ARC: 48G Total, 10G MFU, 38G MRU, 128K Anon, 106M Header, 23M Other
>>>        46G Compressed, 46G Uncompressed, 1.00:1 Ratio
>>> Swap: 425G Total, 3487M Used, 422G Free

-- 
Andriy Gapon