From nobody Mon Feb 21 15:13:29 2022
X-Original-To: dev-commits-src-branches@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id C08CB19D001A;
	Mon, 21 Feb 2022 15:13:29 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4K2QmT4Sgqz4V12;
	Mon, 21 Feb 2022 15:13:29 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1645456409;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=GiDl38d9GUov3eWwAtkFdAM2L4IxR1I5nBVu/cFNyGI=;
	b=HjCooApOGVRMCeYG2hPmm8mLkoEaSi2Sx85x/kWoR9iJkAYjjXEAKeY3Qfk6rUN9+i0DtM
	cUNhBB1Qe5NoyjzMM+9D1Ua8y+VB5/3ibeE70YkfsIyRMkQF/YmcUFfk8rccB1bPL3KW8J
	BvzLS1CpWavOQnfYYJ8vV01djDeLb8E4Gyyko8Kl9mUYyAFDvz19fdZIquz+97LuTu6vcq
	l29Od4JCfaSSx8TRHgZQiNzLjt0la6l9ZDyADc68kKvBvMS1f9vSDlBJhac+lyrp9SLofv
	KQOt2sIv24bytx3WhsKIVmP70auCPWP6ZK24Kn2uPB2n6ZJTbuRRVV1bA25HqQ==
Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 771021264E;
	Mon, 21 Feb 2022 15:13:29 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from gitrepo.freebsd.org ([127.0.1.44])
	by gitrepo.freebsd.org (8.16.1/8.16.1) with ESMTP id 21LFDTia013897;
	Mon, 21 Feb 2022 15:13:29 GMT
	(envelope-from git@gitrepo.freebsd.org)
Received: (from git@localhost)
	by gitrepo.freebsd.org (8.16.1/8.16.1/Submit) id 21LFDTtH013896;
	Mon, 21 Feb 2022 15:13:29 GMT
	(envelope-from git)
Date: Mon, 21 Feb 2022 15:13:29 GMT
Message-Id: <202202211513.21LFDTtH013896@gitrepo.freebsd.org>
To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org,
        dev-commits-src-branches@FreeBSD.org
From: Mark Johnston <markj@FreeBSD.org>
Subject: git: 005452c350f5 - stable/13 - Avoid memory allocations in the ARC eviction thread
List-Id: Commits to the stable branches of the FreeBSD src repository <dev-commits-src-branches.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/dev-commits-src-branches
List-Help: <mailto:dev-commits-src-branches+help@freebsd.org>
List-Post: <mailto:dev-commits-src-branches@freebsd.org>
List-Subscribe: <mailto:dev-commits-src-branches+subscribe@freebsd.org>
List-Unsubscribe: <mailto:dev-commits-src-branches+unsubscribe@freebsd.org>
Sender: owner-dev-commits-src-branches@freebsd.org
X-BeenThere: dev-commits-src-branches@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Git-Committer: markj
X-Git-Repository: src
X-Git-Refname: refs/heads/stable/13
X-Git-Reftype: branch
X-Git-Commit: 005452c350f5f2f3e16267557f3365a198336868
Auto-Submitted: auto-generated
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1645456409;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=GiDl38d9GUov3eWwAtkFdAM2L4IxR1I5nBVu/cFNyGI=;
	b=wYB3Lku+WyzIvA5zTsoZUwGfNyEcmDZZ7grC0MLtxSqeJXsh75jRaBAEVAWXSQyd6WG4C9
	7SXYmwlM/oIBsweM5BE54Qpl1QLNFeM8B15XUCbYyPlgXSP0/5y4H4Kgoud4RHuxiu3XRR
	VXj0RxHwNhuZY1sSDfek8m+lG/zoVmCokRZOTCsMcJgsT4NZH8RWpu7L1jZv48OFDkgYP6
	xgjfEejErcmhHzf8vG2otJrXkF67FQGEBsrLn5LY49TWwdr+cR4hkG50RUNodoCSixodXA
	FaaYwnzleMlKA8h4rP8V9+RuRX2qzBucdQTw4g2+psEwICokjWdshB8T6KX8KA==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1645456409; a=rsa-sha256; cv=none;
	b=KB2M6JMeiK9Y9OZXoOWbj0mf0W+9Xjx6EPRt++ytBQscbUgLRfUZIo/ek/nJ65cbK9mDoq
	+QJHiUtKFzL7LSKr13S3hcSv2SB3c7gsAcg7B689hTIqy74gVjcZ+cHQGN2mR90dz2wuWI
	7JgMVwKaKTy2VpkTKkmAG9R4iAET/FXyHArQN8L5iQDRmSAixIvq9MzY72HxFQdOpystkw
	lvlss9GLidYByOS+NbzxsKa+hwQR1vv2aTUOwsa6og04tcIYzkqhS94hlD5vd6aMTWu3bz
	h7VyP1z5E1szxz+HTVihHAiZj8XQnT7UXZdWUDIo2Nnfy0UpvAH9PKhbhL47fA==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
X-ThisMailContainsUnwantedMimeParts: N

The branch stable/13 has been updated by markj:

URL: https://cgit.FreeBSD.org/src/commit/?id=005452c350f5f2f3e16267557f3365a198336868

commit 005452c350f5f2f3e16267557f3365a198336868
Author:     Mark Johnston <markjdb@gmail.com>
AuthorDate: 2022-01-21 18:28:13 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-02-21 14:59:23 +0000

    Avoid memory allocations in the ARC eviction thread
    
    When the eviction thread goes to shrink an ARC state, it allocates a set
    of marker buffers used to hold its place in the state's sublists.
    
    This can be problematic in low memory conditions, since
    1) the allocation can be substantial, as we allocate NCPU markers;
    2) on at least FreeBSD, page reclamation can block in
       arc_wait_for_eviction()
    
    In particular, in stress tests it's possible to hit a deadlock on
    FreeBSD when the number of free pages is very low, wherein the system is
    waiting for the page daemon to reclaim memory, the page daemon is
    waiting for the ARC eviction thread to finish, and the ARC eviction
    thread is blocked waiting for more memory.
    
    Try to reduce the likelihood of such deadlocks by pre-allocating markers
    for the eviction thread at ARC initialization time.  When evicting
    buffers from an ARC state, check to see if the current thread is the ARC
    eviction thread, and use the pre-allocated markers for that purpose
    rather than dynamically allocating them.
    
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Reviewed-by: George Amanakis <gamanakis@gmail.com>
    Signed-off-by: Mark Johnston <markj@FreeBSD.org>
    Closes #12985
    (cherry picked from commit 6e2a59181e286a397d260fa9f140b58688d60c58)
---
 sys/contrib/openzfs/include/sys/zthr.h |   1 +
 sys/contrib/openzfs/module/zfs/arc.c   | 138 ++++++++++++++++++++-------------
 sys/contrib/openzfs/module/zfs/zthr.c  |   6 ++
 3 files changed, 92 insertions(+), 53 deletions(-)

diff --git a/sys/contrib/openzfs/include/sys/zthr.h b/sys/contrib/openzfs/include/sys/zthr.h
index 19be89eeebe5..4881b4572641 100644
--- a/sys/contrib/openzfs/include/sys/zthr.h
+++ b/sys/contrib/openzfs/include/sys/zthr.h
@@ -38,6 +38,7 @@ extern void zthr_resume(zthr_t *t);
 extern void zthr_wait_cycle_done(zthr_t *t);
 
 extern boolean_t zthr_iscancelled(zthr_t *t);
+extern boolean_t zthr_iscurthread(zthr_t *t);
 extern boolean_t zthr_has_waiters(zthr_t *t);
 
 #endif /* _SYS_ZTHR_H */
diff --git a/sys/contrib/openzfs/module/zfs/arc.c b/sys/contrib/openzfs/module/zfs/arc.c
index 875986de99f0..149ceffc25b3 100644
--- a/sys/contrib/openzfs/module/zfs/arc.c
+++ b/sys/contrib/openzfs/module/zfs/arc.c
@@ -328,6 +328,8 @@ static zthr_t *arc_reap_zthr;
  * arc_evict(), which improves arc_is_overflowing().
  */
 static zthr_t *arc_evict_zthr;
+static arc_buf_hdr_t **arc_state_evict_markers;
+static int arc_state_evict_marker_count;
 
 static kmutex_t arc_evict_lock;
 static boolean_t arc_evict_needed = B_FALSE;
@@ -4149,6 +4151,38 @@ arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
 	return (bytes_evicted);
 }
 
+/*
+ * Allocate an array of buffer headers used as placeholders during arc state
+ * eviction.
+ */
+static arc_buf_hdr_t **
+arc_state_alloc_markers(int count)
+{
+	arc_buf_hdr_t **markers;
+
+	markers = kmem_zalloc(sizeof (*markers) * count, KM_SLEEP);
+	for (int i = 0; i < count; i++) {
+		markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
+
+		/*
+		 * A b_spa of 0 is used to indicate that this header is
+		 * a marker. This fact is used in arc_evict_type() and
+		 * arc_evict_state_impl().
+		 */
+		markers[i]->b_spa = 0;
+
+	}
+	return (markers);
+}
+
+static void
+arc_state_free_markers(arc_buf_hdr_t **markers, int count)
+{
+	for (int i = 0; i < count; i++)
+		kmem_cache_free(hdr_full_cache, markers[i]);
+	kmem_free(markers, sizeof (*markers) * count);
+}
+
 /*
  * Evict buffers from the given arc state, until we've removed the
  * specified number of bytes. Move the removed buffers to the
@@ -4180,19 +4214,15 @@ arc_evict_state(arc_state_t *state, uint64_t spa, uint64_t bytes,
 	 * pick up where we left off for each individual sublist, rather
 	 * than starting from the tail each time.
 	 */
-	markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
+	if (zthr_iscurthread(arc_evict_zthr)) {
+		markers = arc_state_evict_markers;
+		ASSERT3S(num_sublists, <=, arc_state_evict_marker_count);
+	} else {
+		markers = arc_state_alloc_markers(num_sublists);
+	}
 	for (int i = 0; i < num_sublists; i++) {
 		multilist_sublist_t *mls;
 
-		markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
-
-		/*
-		 * A b_spa of 0 is used to indicate that this header is
-		 * a marker. This fact is used in arc_evict_type() and
-		 * arc_evict_state_impl().
-		 */
-		markers[i]->b_spa = 0;
-
 		mls = multilist_sublist_lock(ml, i);
 		multilist_sublist_insert_tail(mls, markers[i]);
 		multilist_sublist_unlock(mls);
@@ -4274,10 +4304,9 @@ arc_evict_state(arc_state_t *state, uint64_t spa, uint64_t bytes,
 		multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
 		multilist_sublist_remove(mls, markers[i]);
 		multilist_sublist_unlock(mls);
-
-		kmem_cache_free(hdr_full_cache, markers[i]);
 	}
-	kmem_free(markers, sizeof (*markers) * num_sublists);
+	if (markers != arc_state_evict_markers)
+		arc_state_free_markers(markers, num_sublists);
 
 	return (total_evicted);
 }
@@ -7594,53 +7623,52 @@ arc_tuning_update(boolean_t verbose)
 	WARN_IF_TUNING_IGNORED(zfs_arc_sys_free, arc_sys_free, verbose);
 }
 
+static void
+arc_state_multilist_init(multilist_t *ml,
+    multilist_sublist_index_func_t *index_func, int *maxcountp)
+{
+	multilist_create(ml, sizeof (arc_buf_hdr_t),
+	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), index_func);
+	*maxcountp = MAX(*maxcountp, multilist_get_num_sublists(ml));
+}
+
 static void
 arc_state_init(void)
 {
-	multilist_create(&arc_mru->arcs_list[ARC_BUFC_METADATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
-	multilist_create(&arc_mru->arcs_list[ARC_BUFC_DATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
-	multilist_create(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
-	multilist_create(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
-	multilist_create(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
-	multilist_create(&arc_mfu->arcs_list[ARC_BUFC_DATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
-	multilist_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
-	multilist_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_multilist_index_func);
+	int num_sublists = 0;
+
+	arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_METADATA],
+	    arc_state_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_DATA],
+	    arc_state_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
+	    arc_state_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
+	    arc_state_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
+	    arc_state_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_DATA],
+	    arc_state_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
+	    arc_state_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
+	    arc_state_multilist_index_func, &num_sublists);
+
 	/*
 	 * L2 headers should never be on the L2 state list since they don't
 	 * have L1 headers allocated.  Special index function asserts that.
 	 */
-	multilist_create(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_l2c_multilist_index_func);
-	multilist_create(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
-	    sizeof (arc_buf_hdr_t),
-	    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
-	    arc_state_l2c_multilist_index_func);
+	arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
+	    arc_state_l2c_multilist_index_func, &num_sublists);
+	arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
+	    arc_state_l2c_multilist_index_func, &num_sublists);
+
+	/*
+	 * Keep track of the number of markers needed to reclaim buffers from
+	 * any ARC state.  The markers will be pre-allocated so as to minimize
+	 * the number of memory allocations performed by the eviction thread.
+	 */
+	arc_state_evict_marker_count = num_sublists;
 
 	zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
 	zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
@@ -7985,6 +8013,8 @@ arc_init(void)
 		kstat_install(arc_ksp);
 	}
 
+	arc_state_evict_markers =
+	    arc_state_alloc_markers(arc_state_evict_marker_count);
 	arc_evict_zthr = zthr_create("arc_evict",
 	    arc_evict_cb_check, arc_evict_cb, NULL, defclsyspri);
 	arc_reap_zthr = zthr_create_timer("arc_reap",
@@ -8052,6 +8082,8 @@ arc_fini(void)
 
 	(void) zthr_cancel(arc_evict_zthr);
 	(void) zthr_cancel(arc_reap_zthr);
+	arc_state_free_markers(arc_state_evict_markers,
+	    arc_state_evict_marker_count);
 
 	mutex_destroy(&arc_evict_lock);
 	list_destroy(&arc_evict_waiters);
diff --git a/sys/contrib/openzfs/module/zfs/zthr.c b/sys/contrib/openzfs/module/zfs/zthr.c
index 33fdda7b68d1..52ddffae7aaa 100644
--- a/sys/contrib/openzfs/module/zfs/zthr.c
+++ b/sys/contrib/openzfs/module/zfs/zthr.c
@@ -469,6 +469,12 @@ zthr_iscancelled(zthr_t *t)
 	return (cancelled);
 }
 
+boolean_t
+zthr_iscurthread(zthr_t *t)
+{
+	return (t->zthr_thread == curthread);
+}
+
 /*
  * Wait for the zthr to finish its current function. Similar to
  * zthr_iscancelled, you can use zthr_has_waiters to have the zthr_func end