From nobody Mon Nov 15 15:08:29 2021 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 480F9189BDDB for ; Mon, 15 Nov 2021 15:08:32 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HtCJ003y9z3DxT; Mon, 15 Nov 2021 15:08:32 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from [192.168.0.88] (unknown [195.64.148.76]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: avg/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 65DBB255B6; Mon, 15 Nov 2021 15:08:31 +0000 (UTC) (envelope-from avg@freebsd.org) From: Andriy Gapon To: Mark Johnston Cc: Chris Ross , freebsd-fs References: <9FE99EEF-37C5-43D1-AC9D-17F3EDA19606@distal.com> <09989390-FED9-45A6-A866-4605D3766DFE@distal.com> <4E5511DF-B163-4928-9CC3-22755683999E@distal.com> <19A3AAF6-149B-4A3C-8C27-4CFF22382014@distal.com> <6DA63618-F0E9-48EC-AB57-3C3C102BC0C0@distal.com> <35c14795-3b1c-9315-8e9b-a8dfad575a04@FreeBSD.org> Subject: Re: swap_pager: cannot allocate bio Message-ID: Date: Mon, 15 Nov 2021 17:08:29 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:78.0) Gecko/20100101 Firefox/78.0 Thunderbird/78.14.0 List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-ThisMailContainsUnwantedMimeParts: N On 15/11/2021 16:50, Mark Johnston wrote: > On Mon, Nov 15, 2021 at 04:20:26PM +0200, Andriy Gapon wrote: >> On 15/11/2021 05:26, Chris Ross wrote: >>> A procstat -kka output is available (208kb of text, 1441 lines) at >>> https://pastebin.com/SvDcvRvb >> >> 67 100542 pagedaemon dom0 mi_switch+0xc1 >> _cv_wait+0xf2 arc_wait_for_eviction+0x1df arc_lowmem+0xca >> vm_pageout_worker+0x3c4 vm_pageout+0x1d7 fork_exit+0x8a fork_trampoline+0xe >> >> I was always of an opinion that waiting for the ARC reclaim in arc_lowmem was >> wrong. This shows an example of why it is so. >> >>> An ssh of a top command completed and shows: >>> >>> last pid: 91551; load averages: 0.00, 0.02, 0.30 up 2+00:19:33 22:23:15 >>> 40 processes: 1 running, 38 sleeping, 1 zombie >>> CPU: 3.9% user, 0.0% nice, 0.9% system, 0.0% interrupt, 95.2% idle >>> Mem: 58G Active, 210M Inact, 1989M Laundry, 52G Wired, 1427M Buf, 12G Free >> >> To me it looks like there is still plenty of free memory. >> >> I am not sure why vm_wait_domain (called by vm_page_alloc_noobj_domain) is not >> waking up. > > It's a deadlock: the page daemon is sleeping on the arc evict thread, > and the arc evict thread is waiting for memory: My point was that waiting for the free memory was not strictly needed yet given 12G free, but that's kind of obvious. > 2561 100722 zfskern arc_evict > mi_switch+0xc1 _sleep+0x1cb vm_wait_doms+0xe2 vm_wait_domain+0x51 > vm_page_alloc_noobj_domain+0x184 uma_small_alloc+0x62 keg_alloc_slab+0xb0 > zone_import+0xee zone_alloc_item+0x6f arc_evict_state+0x81 arc_evict_cb+0x483 > zthr_procedure+0xba fork_exit+0x8a fork_trampoline+0xe > > I presume this is from the marker allocations in arc_evict_state(). > > The second problem is that UMA is refusing to try to allocate from the > "wrong" NUMA domain, but that policy seems overly strict. Fixing that > alone would make the problem harder to hit, but I think it wouldn't > solve it completely. Yes, I propose to remove the wait for ARC evictions from arc_lowmem(). Another thing that may help a bit is having a greater "slack" between a threshold where the page daemon starts paging out and a threshold where memory allocations start to wait (via vm_wait_domain). Also, I think that for a long time we had a problem (but not sure if it's still present) where allocations succeeded without waiting until the free memory went below certain threshold M, but once a thread started waiting in vm_wait it would not be woken up until the free memory went above another threshold N. And the problem was that N >> M. In other words, a lot of memory had to be freed (and not grabbed by other threads) before the waiting thread would be woken up. >> Perhaps this is some sort of a NUMA related issue where one memory domain is >> exhausted while other(s) still have a lot of memory. >> Or maybe it's something else but it must be some sort of a bug. >> >>> ARC: 48G Total, 10G MFU, 38G MRU, 128K Anon, 106M Header, 23M Other >>> 46G Compressed, 46G Uncompressed, 1.00:1 Ratio >>> Swap: 425G Total, 3487M Used, 422G Free -- Andriy Gapon