From nobody Mon Sep 04 06:19:47 2023 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RfJQY1kN5z4rc5v for ; Mon, 4 Sep 2023 06:20:05 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Received: from mailgate.Leidinger.net (bastille.leidinger.net [89.238.82.207]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature ECDSA (P-256) client-digest SHA256) (Client CN "mailgate.leidinger.net", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4RfJQX0q86z3Kq7 for ; Mon, 4 Sep 2023 06:20:04 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=leidinger.net header.s=outgoing-alex header.b=P6JL3Efr; spf=pass (mx1.freebsd.org: domain of Alexander@Leidinger.net designates 89.238.82.207 as permitted sender) smtp.mailfrom=Alexander@Leidinger.net; dmarc=pass (policy=quarantine) header.from=leidinger.net Received: from webmail2.leidinger.net (roundcube.Leidinger.net [192.168.1.123]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: Alexander@Leidinger.net) by outgoing.leidinger.net (Postfix) with ESMTPSA id 0313F5BCA; Mon, 4 Sep 2023 08:19:47 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=leidinger.net; s=outgoing-alex; t=1693808391; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rZOtwYh4cvys9Ia+1x3htx8mNOAYye1ecY9rMwg4u+M=; b=P6JL3Efrlu38GP5UMoIbMZ8LsFd1U8UXSfcOd2quf0nbSRO0bZfarjd53NjYqnFX6gM9A8 g93pQU49Ndvi/oztp/Lq7SGaPlTjKWBHvxPVUreOiD3cRsZ0tw6FaZGr6f5+8iusPRieox M3QQu9gTYgbj8IDDcrUy3m5ByehMJgFDSYYrCTZUyYeT//9gEJI9nitgtuS9SH2jHd399A 4+iWPtOxoObfBXyXubmF479K+to7zPkAZoEpDdm/RYAPPn1bEk+oy8Fo6tdWwcDChpXbQY 1ThRgZePqat6l+vExDOiVe/RbXTFyXQQpln3E3oHsdlwtG3Xq7SZthSvczLZpw== List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Date: Mon, 04 Sep 2023 08:19:47 +0200 From: Alexander Leidinger To: Mateusz Guzik Cc: Konstantin Belousov , current@freebsd.org Subject: Re: Speed improvements in ZFS In-Reply-To: <076f09cc0b99643072d8b80a6ec5b03b@Leidinger.net> References: <88e837aeb5a65c1f001de2077fb7bcbd@Leidinger.net> <4d60bd12b482e020fd4b186a9ec1a250@Leidinger.net> <73f7c9d3db8f117deb077fb17b1e352a@Leidinger.net> <58493b568dbe9fb52cc55de86e01f5e2@Leidinger.net> <58ac6211235c52d744666e8ae2ec7568@Leidinger.net> <444770b977b02b98985928bea450e4ce@Leidinger.net> <076f09cc0b99643072d8b80a6ec5b03b@Leidinger.net> Message-ID: <1d0d37f27e4898f1604c6ddc6ad3e831@Leidinger.net> X-Sender: Alexander@Leidinger.net Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.90 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.998]; DMARC_POLICY_ALLOW(-0.50)[leidinger.net,quarantine]; R_DKIM_ALLOW(-0.20)[leidinger.net:s=outgoing-alex]; R_SPF_ALLOW(-0.20)[+mx:c]; MIME_GOOD(-0.10)[text/plain]; ONCE_RECEIVED(0.10)[]; RCVD_COUNT_ONE(0.00)[1]; FROM_EQ_ENVFROM(0.00)[]; FREEMAIL_TO(0.00)[gmail.com]; MLMMJ_DEST(0.00)[current@freebsd.org]; MIME_TRACE(0.00)[0:+]; FREEMAIL_CC(0.00)[gmail.com,freebsd.org]; ASN(0.00)[asn:34240, ipnet:89.238.64.0/18, country:DE]; RCPT_COUNT_THREE(0.00)[3]; RCVD_VIA_SMTP_AUTH(0.00)[]; DKIM_TRACE(0.00)[leidinger.net:+]; FROM_HAS_DN(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; TO_DN_SOME(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4RfJQX0q86z3Kq7 Am 2023-08-28 22:33, schrieb Alexander Leidinger: > Am 2023-08-22 18:59, schrieb Mateusz Guzik: >> On 8/22/23, Alexander Leidinger wrote: >>> Am 2023-08-21 10:53, schrieb Konstantin Belousov: >>>> On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote: >>>>> Am 2023-08-20 23:17, schrieb Konstantin Belousov: >>>>> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote: >>>>> > > On 8/20/23, Alexander Leidinger wrote: >>>>> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik: >>>>> > > >> On 8/20/23, Alexander Leidinger wrote: >>>>> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik: >>>>> > > >>>> On 8/18/23, Alexander Leidinger >>>>> > > >>>> wrote: >>>>> > > >>> >>>>> > > >>>>> I have a 51MB text file, compressed to about 1MB. Are you >>>>> > > >>>>> interested >>>>> > > >>>>> to >>>>> > > >>>>> get it? >>>>> > > >>>>> >>>>> > > >>>> >>>>> > > >>>> Your problem is not the vnode limit, but nullfs. >>>>> > > >>>> >>>>> > > >>>> https://people.freebsd.org/~mjg/netchild-periodic-find.svg >>>>> > > >>> >>>>> > > >>> 122 nullfs mounts on this system. And every jail I setup has >>>>> > > >>> several >>>>> > > >>> null mounts. One basesystem mounted into every jail, and then >>>>> > > >>> shared >>>>> > > >>> ports (packages/distfiles/ccache) across all of them. >>>>> > > >>> >>>>> > > >>>> First, some of the contention is notorious VI_LOCK in order to >>>>> > > >>>> do >>>>> > > >>>> anything. >>>>> > > >>>> >>>>> > > >>>> But more importantly the mind-boggling off-cpu time comes from >>>>> > > >>>> exclusive locking which should not be there to begin with -- as >>>>> > > >>>> in >>>>> > > >>>> that xlock in stat should be a slock. >>>>> > > >>>> >>>>> > > >>>> Maybe I'm going to look into it later. >>>>> > > >>> >>>>> > > >>> That would be fantastic. >>>>> > > >>> >>>>> > > >> >>>>> > > >> I did a quick test, things are shared locked as expected. >>>>> > > >> >>>>> > > >> However, I found the following: >>>>> > > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) { >>>>> > > >> mp->mnt_kern_flag |= >>>>> > > >> lowerrootvp->v_mount->mnt_kern_flag & >>>>> > > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED | >>>>> > > >> MNTK_EXTENDED_SHARED); >>>>> > > >> } >>>>> > > >> >>>>> > > >> are you using the "nocache" option? it has a side effect of >>>>> > > >> xlocking >>>>> > > > >>>>> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache. >>>>> > > > >>>>> > > >>>>> > > If you don't have "nocache" on null mounts, then I don't see how >>>>> > > this >>>>> > > could happen. >>>>> > >>>>> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set >>>>> > for >>>>> > fuse and nfs at least. >>>>> >>>>> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS >>>>> exported. >>>>> 6 of those nullfs mounts are also exported via Samba. The NFS >>>>> exports >>>>> shouldn't be needed anymore, I will remove them. >>>> By nfs I meant nfs client, not nfs exports. >>> >>> No NFS client mounts anywhere on this system. So where is this >>> exclusive >>> lock coming from then... >>> This is a ZFS system. 2 pools: one for the root, one for anything I >>> need >>> space for. Both pools reside on the same disks. The root pool is a >>> 3-way >>> mirror, the "space-pool" is a 5-disk raidz2. All jails are on the >>> space-pool. The jails are all basejail-style jails. >>> >> >> While I don't see why xlocking happens, you should be able to dtrace >> or printf your way into finding out. > > dtrace looks to me like a faster approach to get to the root than > printf... my first naive try is to detect exclusive locks. I'm not 100% > sure I got it right, but at least dtrace doesn't complain about it: > ---snip--- > #pragma D option dynvarsize=32m > > fbt:nullfs:null_lock:entry > /args[0]->a_flags & 0x080000 != 0/ > { > stack(); > } > ---snip--- > > In which direction should I look with dtrace if this works in tonights > run of periodic? I don't have enough knowledge about VFS to come up > with some immediate ideas. After your sysctl fix for maxvnodes I increased the amount of vnodes 10 times compared to the initial report. This has increased the speed of the operation, the find runs in all those jails finished today after ~5h (@~8am) instead of in the afternoon as before. Could this suggest that in parallel some null_reclaim() is running which does the exclusive locks and slows down the entire operation? Bye, Alexander. -- http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF