Re: Speed improvements in ZFS

From: Alexander Leidinger <Alexander_at_Leidinger.net>
Date: Fri, 15 Sep 2023 10:09:29 UTC
Am 2023-09-04 14:26, schrieb Mateusz Guzik:
> On 9/4/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>> Am 2023-08-28 22:33, schrieb Alexander Leidinger:
>>> Am 2023-08-22 18:59, schrieb Mateusz Guzik:
>>>> On 8/22/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>>>>> Am 2023-08-21 10:53, schrieb Konstantin Belousov:
>>>>>> On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger 
>>>>>> wrote:
>>>>>>> Am 2023-08-20 23:17, schrieb Konstantin Belousov:
>>>>>>> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote:
>>>>>>> > > On 8/20/23, Alexander Leidinger <Alexander@leidinger.net> wrote:
>>>>>>> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik:
>>>>>>> > > >> On 8/20/23, Alexander Leidinger <Alexander@leidinger.net>
>>>>>>> > > >> wrote:
>>>>>>> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik:
>>>>>>> > > >>>> On 8/18/23, Alexander Leidinger <Alexander@leidinger.net>
>>>>>>> > > >>>> wrote:
>>>>>>> > > >>>
>>>>>>> > > >>>>> I have a 51MB text file, compressed to about 1MB. Are you
>>>>>>> > > >>>>> interested
>>>>>>> > > >>>>> to
>>>>>>> > > >>>>> get it?
>>>>>>> > > >>>>>
>>>>>>> > > >>>>
>>>>>>> > > >>>> Your problem is not the vnode limit, but nullfs.
>>>>>>> > > >>>>
>>>>>>> > > >>>> https://people.freebsd.org/~mjg/netchild-periodic-find.svg
>>>>>>> > > >>>
>>>>>>> > > >>> 122 nullfs mounts on this system. And every jail I setup has
>>>>>>> > > >>> several
>>>>>>> > > >>> null mounts. One basesystem mounted into every jail, and then
>>>>>>> > > >>> shared
>>>>>>> > > >>> ports (packages/distfiles/ccache) across all of them.
>>>>>>> > > >>>
>>>>>>> > > >>>> First, some of the contention is notorious VI_LOCK in order
>>>>>>> > > >>>> to
>>>>>>> > > >>>> do
>>>>>>> > > >>>> anything.
>>>>>>> > > >>>>
>>>>>>> > > >>>> But more importantly the mind-boggling off-cpu time comes
>>>>>>> > > >>>> from
>>>>>>> > > >>>> exclusive locking which should not be there to begin with --
>>>>>>> > > >>>> as
>>>>>>> > > >>>> in
>>>>>>> > > >>>> that xlock in stat should be a slock.
>>>>>>> > > >>>>
>>>>>>> > > >>>> Maybe I'm going to look into it later.
>>>>>>> > > >>>
>>>>>>> > > >>> That would be fantastic.
>>>>>>> > > >>>
>>>>>>> > > >>
>>>>>>> > > >> I did a quick test, things are shared locked as expected.
>>>>>>> > > >>
>>>>>>> > > >> However, I found the following:
>>>>>>> > > >>         if ((xmp->nullm_flags & NULLM_CACHE) != 0) {
>>>>>>> > > >>                 mp->mnt_kern_flag |=
>>>>>>> > > >> lowerrootvp->v_mount->mnt_kern_flag &
>>>>>>> > > >>                     (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED |
>>>>>>> > > >>                     MNTK_EXTENDED_SHARED);
>>>>>>> > > >>         }
>>>>>>> > > >>
>>>>>>> > > >> are you using the "nocache" option? it has a side effect of
>>>>>>> > > >> xlocking
>>>>>>> > > >
>>>>>>> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache.
>>>>>>> > > >
>>>>>>> > >
>>>>>>> > > If you don't have "nocache" on null mounts, then I don't see how
>>>>>>> > > this
>>>>>>> > > could happen.
>>>>>>> >
>>>>>>> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set
>>>>>>> > for
>>>>>>> > fuse and nfs at least.
>>>>>>> 
>>>>>>> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS
>>>>>>> exported.
>>>>>>> 6 of those nullfs mounts are also exported via Samba. The NFS
>>>>>>> exports
>>>>>>> shouldn't be needed anymore, I will remove them.
>>>>>> By nfs I meant nfs client, not nfs exports.
>>>>> 
>>>>> No NFS client mounts anywhere on this system. So where is this
>>>>> exclusive
>>>>> lock coming from then...
>>>>> This is a ZFS system. 2 pools: one for the root, one for anything I
>>>>> need
>>>>> space for. Both pools reside on the same disks. The root pool is a
>>>>> 3-way
>>>>> mirror, the "space-pool" is a 5-disk raidz2. All jails are on the
>>>>> space-pool. The jails are all basejail-style jails.
>>>>> 
>>>> 
>>>> While I don't see why xlocking happens, you should be able to dtrace
>>>> or printf your way into finding out.
>>> 
>>> dtrace looks to me like a faster approach to get to the root than
>>> printf... my first naive try is to detect exclusive locks. I'm not 
>>> 100%
>>> sure I got it right, but at least dtrace doesn't complain about it:
>>> ---snip---
>>> #pragma D option dynvarsize=32m
>>> 
>>> fbt:nullfs:null_lock:entry
>>> /args[0]->a_flags & 0x080000 != 0/
>>> {
>>>         stack();
>>> }
>>> ---snip---
>>> 
>>> In which direction should I look with dtrace if this works in 
>>> tonights
>>> run of periodic? I don't have enough knowledge about VFS to come up
>>> with some immediate ideas.
>> 
>> After your sysctl fix for maxvnodes I increased the amount of vnodes 
>> 10
>> times compared to the initial report. This has increased the speed of
>> the operation, the find runs in all those jails finished today after 
>> ~5h
>> (@~8am) instead of in the afternoon as before. Could this suggest that
>> in parallel some null_reclaim() is running which does the exclusive
>> locks and slows down the entire operation?
>> 
> 
> That may be a slowdown to some extent, but the primary problem is
> exclusive vnode locking for stat lookup, which should not be
> happening.

With -current as of 2023-09-03 (and right now 2023-09-11), the periodic 
daily runs are down to less than an hour... and this didn't happen 
directly after switching to 2023-09-13. First it went down to 4h, then 
down to 1h without any update of the OS. The only thing what I did was 
modifying the number of maxfiles. First to some huge amount after your 
commit in the sysctl affecting part. Then after noticing way more 
freevnodes than configured down to 500000000.

Bye,
Alexander.

-- 
http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF