From nobody Mon Sep 04 12:26:05 2023
X-Original-To: current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RfSXw1DnPz4sSHq
	for <current@mlmmj.nyi.freebsd.org>; Mon,  4 Sep 2023 12:26:08 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Received: from mail-oo1-xc2c.google.com (mail-oo1-xc2c.google.com [IPv6:2607:f8b0:4864:20::c2c])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4RfSXv6kdGz4Cnp
	for <current@freebsd.org>; Mon,  4 Sep 2023 12:26:07 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-oo1-xc2c.google.com with SMTP id 006d021491bc7-57325fcd970so895491eaf.1
        for <current@freebsd.org>; Mon, 04 Sep 2023 05:26:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1693830366; x=1694435166; darn=freebsd.org;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=xK3NAf01nk+yuFFKrzJLlLVQeM8QbCiKpkH8czocnl0=;
        b=SSNlSI5PNWAKtkDrYsCDt1ZL5DAu2jhTj2SS6e2hlWXXy1xdH3NGMDPHkt9LJrDjr7
         s8uzal+0MD4wSZIQ4atwq4cD4KvcDrzGrVhU6mX7i1oZt2y0K+5BpNPSf8IHUKoerSlI
         qeNXcIowt7KOvJzL4TckQQdf+ygCxZYklGO+9fipp7hlIhdBjPVBblo+fPKA24w2Koct
         3BbWW/W5kF+BDS0wDr23ReKaqJu8Lp+2gdDwZdn7Yj90jltvPyi/P+ggyku3ixDZK/Yh
         4MAPGuI1Q/HEwVEpsKYkRYqv/UjwZcbxLgekdbUDElfoxERtND97L1pPDMhKwM+adWhP
         4gyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1693830366; x=1694435166;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=xK3NAf01nk+yuFFKrzJLlLVQeM8QbCiKpkH8czocnl0=;
        b=SVnal+/OqjHXSbCaA8lSJQlnd4BRXEhJJbNyDcXS4KR4DCZCjUYYMNM/JJ1SxVV5cP
         jEoQUTWA/LCy7bKfQkAAB0F/mlUWIodKvQhWXcF2psuQhN4FOwTP58iNSTX9t/t4OQRW
         ONZ4q26m63n/GZdBv2xZqBpfIYk39BpQnh8XMaTVszhquqGbg1Hw+NJueo5jEcTEB+bM
         un89+2a4DhdWwofTTEPy+8zyFk+SdVA6j/wTEOymx8kyKUHEreldXstlHBfdWfIdJIFw
         uxASxVN/3aSG9NjYQKFAGW5yEFkZHVj0Tj5pYO/rCLvCq8DbmDf1YAaeMJ06KdoaieDS
         exeg==
X-Gm-Message-State: AOJu0YxGGldMoABzzckfyp/V8tn1R4wlA65WZ6gRI1tpbA+A8fBvjYOv
	QnH5nC9o7v0vORUg8Vx4/oyWOiyCoZqjzw5iHNg=
X-Google-Smtp-Source: AGHT+IGoNvENWDjGaZYdkdFD2DeBsZnbvo6CGLWRlwAPunk6bwLrGMYZUC/eqGO68mdMvPYFBQm5cymHmKbRycd08+4=
X-Received: by 2002:a4a:9c01:0:b0:570:be00:5e14 with SMTP id
 y1-20020a4a9c01000000b00570be005e14mr8306141ooj.8.1693830366458; Mon, 04 Sep
 2023 05:26:06 -0700 (PDT)
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
MIME-Version: 1.0
Received: by 2002:a8a:60c:0:b0:4f0:1250:dd51 with HTTP; Mon, 4 Sep 2023
 05:26:05 -0700 (PDT)
In-Reply-To: <1d0d37f27e4898f1604c6ddc6ad3e831@Leidinger.net>
References: <CAGudoHEP8TrSzz0TL-PsOx0WNc7z3042wJk-jhhVwhTyJ0VEQQ@mail.gmail.com>
 <88e837aeb5a65c1f001de2077fb7bcbd@Leidinger.net> <4d60bd12b482e020fd4b186a9ec1a250@Leidinger.net>
 <CAGudoHE7RPcHpQEqKbzRM8cJcYKue17=iPVv8iOfZq03h22tTA@mail.gmail.com>
 <73f7c9d3db8f117deb077fb17b1e352a@Leidinger.net> <CAGudoHGPw0Dmnv6ont8JGyLsT7qv+QqAFZO3tKOpNo3eN+JgLQ@mail.gmail.com>
 <58493b568dbe9fb52cc55de86e01f5e2@Leidinger.net> <CAGudoHEyZh1DU=j_6mOfB3tSKhC-pNokPgONDbf4oF3D3A5=jg@mail.gmail.com>
 <ZOKC3-6uyPUO8qNY@kib.kiev.ua> <58ac6211235c52d744666e8ae2ec7568@Leidinger.net>
 <ZOMmHF0RiVyroUk8@kib.kiev.ua> <444770b977b02b98985928bea450e4ce@Leidinger.net>
 <CAGudoHF20EVPcrdRixfhktp-==8=CuYLY6wpPkXLRRizQLCsKA@mail.gmail.com>
 <076f09cc0b99643072d8b80a6ec5b03b@Leidinger.net> <1d0d37f27e4898f1604c6ddc6ad3e831@Leidinger.net>
From: Mateusz Guzik <mjguzik@gmail.com>
Date: Mon, 4 Sep 2023 14:26:05 +0200
Message-ID: <CAGudoHGsynENP1rEc-ncYmLW29-LiUTtEPhPW4y8DDDEKFWkkQ@mail.gmail.com>
Subject: Re: Speed improvements in ZFS
To: Alexander Leidinger <Alexander@leidinger.net>
Cc: Konstantin Belousov <kostikbel@gmail.com>, current@freebsd.org
Content-Type: text/plain; charset="UTF-8"
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]
X-Rspamd-Queue-Id: 4RfSXv6kdGz4Cnp

On 9/4/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
> Am 2023-08-28 22:33, schrieb Alexander Leidinger:
>> Am 2023-08-22 18:59, schrieb Mateusz Guzik:
>>> On 8/22/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>>>> Am 2023-08-21 10:53, schrieb Konstantin Belousov:
>>>>> On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote:
>>>>>> Am 2023-08-20 23:17, schrieb Konstantin Belousov:
>>>>>> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
>>>>>> > > On 8/20/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>>>>>> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
>>>>>> > > >> On 8/20/23, Alexander Leidinger <Alexander@leidinger.net>
>>>>>> > > >> wrote:
>>>>>> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
>>>>>> > > >>>> On 8/18/23, Alexander Leidinger <Alexander@leidinger.net>
>>>>>> > > >>>> wrote:
>>>>>> > > >>>
>>>>>> > > >>>>> I have a 51MB text file, compressed to about 1MB. Are you
>>>>>> > > >>>>> interested
>>>>>> > > >>>>> to
>>>>>> > > >>>>> get it?
>>>>>> > > >>>>>
>>>>>> > > >>>>
>>>>>> > > >>>> Your problem is not the vnode limit, but nullfs.
>>>>>> > > >>>>
>>>>>> > > >>>> https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>>>>>> > > >>>
>>>>>> > > >>> 122 nullfs mounts on this system. And every jail I setup has
>>>>>> > > >>> several
>>>>>> > > >>> null mounts. One basesystem mounted into every jail, and then
>>>>>> > > >>> shared
>>>>>> > > >>> ports (packages/distfiles/ccache) across all of them.
>>>>>> > > >>>
>>>>>> > > >>>> First, some of the contention is notorious VI_LOCK in order
>>>>>> > > >>>> to
>>>>>> > > >>>> do
>>>>>> > > >>>> anything.
>>>>>> > > >>>>
>>>>>> > > >>>> But more importantly the mind-boggling off-cpu time comes
>>>>>> > > >>>> from
>>>>>> > > >>>> exclusive locking which should not be there to begin with --
>>>>>> > > >>>> as
>>>>>> > > >>>> in
>>>>>> > > >>>> that xlock in stat should be a slock.
>>>>>> > > >>>>
>>>>>> > > >>>> Maybe I'm going to look into it later.
>>>>>> > > >>>
>>>>>> > > >>> That would be fantastic.
>>>>>> > > >>>
>>>>>> > > >>
>>>>>> > > >> I did a quick test, things are shared locked as expected.
>>>>>> > > >>
>>>>>> > > >> However, I found the following:
>>>>>> > > >>         if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
>>>>>> > > >>                 mp->mnt_kern_flag |=
>>>>>> > > >> lowerrootvp->v_mount->mnt_kern_flag &
>>>>>> > > >>                     (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
>>>>>> > > >>                     MNTK_EXTENDED_SHARED);
>>>>>> > > >>         }
>>>>>> > > >>
>>>>>> > > >> are you using the "nocache" option? it has a side effect of
>>>>>> > > >> xlocking
>>>>>> > > >
>>>>>> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
>>>>>> > > >
>>>>>> > >
>>>>>> > > If you don't have "nocache" on null mounts, then I don't see how
>>>>>> > > this
>>>>>> > > could happen.
>>>>>> >
>>>>>> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
>>>>>> > for
>>>>>> > fuse and nfs at least.
>>>>>>
>>>>>> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS
>>>>>> exported.
>>>>>> 6 of those nullfs mounts are also exported via Samba. The NFS
>>>>>> exports
>>>>>> shouldn't be needed anymore, I will remove them.
>>>>> By nfs I meant nfs client, not nfs exports.
>>>>
>>>> No NFS client mounts anywhere on this system. So where is this
>>>> exclusive
>>>> lock coming from then...
>>>> This is a ZFS system. 2 pools: one for the root, one for anything I
>>>> need
>>>> space for. Both pools reside on the same disks. The root pool is a
>>>> 3-way
>>>> mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
>>>> space-pool. The jails are all basejail-style jails.
>>>>
>>>
>>> While I don't see why xlocking happens, you should be able to dtrace
>>> or printf your way into finding out.
>>
>> dtrace looks to me like a faster approach to get to the root than
>> printf... my first naive try is to detect exclusive locks. I'm not 100%
>> sure I got it right, but at least dtrace doesn't complain about it:
>> ---snip---
>> #pragma D option dynvarsize=32m
>>
>> fbt:nullfs:null_lock:entry
>> /args[0]->a_flags & 0x080000 != 0/
>> {
>>         stack();
>> }
>> ---snip---
>>
>> In which direction should I look with dtrace if this works in tonights
>> run of periodic? I don't have enough knowledge about VFS to come up
>> with some immediate ideas.
>
> After your sysctl fix for maxvnodes I increased the amount of vnodes 10
> times compared to the initial report. This has increased the speed of
> the operation, the find runs in all those jails finished today after ~5h
> (@~8am) instead of in the afternoon as before. Could this suggest that
> in parallel some null_reclaim() is running which does the exclusive
> locks and slows down the entire operation?
>

That may be a slowdown to some extent, but the primary problem is
exclusive vnode locking for stat lookup, which should not be
happening.

-- 
Mateusz Guzik <mjguzik gmail.com>