From nobody Wed Apr 20 09:43:10 2022 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id A95E813A3791 for ; Wed, 20 Apr 2022 09:43:13 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-lf1-x12a.google.com (mail-lf1-x12a.google.com [IPv6:2a00:1450:4864:20::12a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Kjwhc74pkz3vhf for ; Wed, 20 Apr 2022 09:43:12 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: by mail-lf1-x12a.google.com with SMTP id h27so1588954lfj.13 for ; Wed, 20 Apr 2022 02:43:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=XuavJ3cmoR3Rorylh7hErg6tgMapT7o/OwnHl1VyB98=; b=CREUjbGoEEyGcMN27W/flD9roIzsWJUnzBdAo04DqE48hQ5IA49HuLFAjMODFW/AQ3 DbqMgp0P8cSQlvpddpEMROZvxr1ycyUZwnRqtMqthdOEgX9XZBFPmjhnTicCI2B/lfRl 7UyNovajDbTcCKSJ7Je2hgc/ZtuWExls32XQQh7EqxMPGQxuU3e6Bn75hTsx8mjW9wIZ ixJ/yZaIy32Dsk5naIPg+fLDdoi9OeWEwbcB+EJ9mwFJJFbHcg+UsHmkFIYSyCcznRJr P1Rdp4D/VIQa5Z5Tc88JFPfMGgNKZyHIeN3sp41FVagv84TQIDIQ2Vak4C4kgXHqqbW9 fO4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=XuavJ3cmoR3Rorylh7hErg6tgMapT7o/OwnHl1VyB98=; b=xXnwK3bB439jOs+e1Fxk3iG85daJ6SyRkiYrqlqxw7qsVG1ov8CbfxdD70p3FeuAHO +9ukny9vMrcNtMhAbHx69GUhkSSfJwCj9PhZd0LCxc90IihUhN6cuFjyw9vxZi8Lxwzj cPMipuRi0DKU3kAeoRLuxnLLs/WdLVRwtEBkv954HZPWaUsut/UEFY1ysjxSBEWrT+GT f88SEqYTW/eJvM2LfWKIeLSTkLz30PtBPzLSJkoUoA3bVVB+eS7v8dCjvrpQtPAr8uad Ji0m6tO1xPyD1qVmAgqUXqwYJ6Q/IPZVvQ66SZD44/xOz9BhEqjIfFixJ9UEr8Kz9G0t tc+w== X-Gm-Message-State: AOAM53280NUgwhnbuPSvbKJpAkOflyhUCdlUEbvDJjliq1KBlKEAf+73 U9iD5cI+FDvG0e04Y72lym2VQwkqgxf1ZGvcY6eKB44c X-Google-Smtp-Source: ABdhPJxp+v5QNFEcP7s1Qd7jag0yw9taaZOfNFGyv+xqOj/t4x+olJdKtn84y63BEosf8JkPVQwaR0RPA9bDAEokURg= X-Received: by 2002:a05:651c:b0b:b0:24d:f050:2836 with SMTP id b11-20020a05651c0b0b00b0024df0502836mr1350865ljr.296.1650447791581; Wed, 20 Apr 2022 02:43:11 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Received: by 2002:a05:6520:6145:b0:1bb:7433:4cdd with HTTP; Wed, 20 Apr 2022 02:43:10 -0700 (PDT) In-Reply-To: References: From: Mateusz Guzik Date: Wed, 20 Apr 2022 11:43:10 +0200 Message-ID: Subject: Re: nullfs and ZFS issues To: Doug Ambrisko Cc: freebsd-current@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4Kjwhc74pkz3vhf X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=CREUjbGo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of mjguzik@gmail.com designates 2a00:1450:4864:20::12a as permitted sender) smtp.mailfrom=mjguzik@gmail.com X-Spamd-Result: default: False [-2.00 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; FREEMAIL_FROM(0.00)[gmail.com]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-current@freebsd.org]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_SPAM_SHORT(1.00)[1.000]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::12a:from]; MLMMJ_DEST(0.00)[freebsd-current]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-ThisMailContainsUnwantedMimeParts: N On 4/19/22, Doug Ambrisko wrote: > On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote: > | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff > | > | this is not committable but should validate whether it works fine > > As a POC it's working. I see the vnode count for the nullfs and > ZFS go up. The ARC cache also goes up until it exceeds the ARC max. > size tten the vnodes for nullfs and ZFS goes down. The ARC cache goes > down as well. This all repeats over and over. The systems seems > healthy. No excessive running of arc_prune or arc_evict. > > My only comment is that the vnode freeing seems a bit agressive. > Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS. > The ARC drops from 70M to 7M (max is set at 64M) for this unit > test. > Can you check what kind of shrinking is requested by arc to begin with? I imagine encountering a nullfs vnode may end up recycling 2 instead of 1, but even repeated a lot it does not explain the above. > > | On 4/19/22, Mateusz Guzik wrote: > | > On 4/19/22, Mateusz Guzik wrote: > | >> On 4/19/22, Doug Ambrisko wrote: > | >>> I've switched my laptop to use nullfs and ZFS. Previously, I used > | >>> localhost NFS mounts instead of nullfs when nullfs would complain > | >>> that it couldn't mount. Since that check has been removed, I've > | >>> switched to nullfs only. However, every so often my laptop would > | >>> get slow and the the ARC evict and prune thread would consume two > | >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased > | >>> it to 2G now. Looking into this has uncovered some issues: > | >>> - nullfs would prevent vnlru_free_vfsops from doing anything > | >>> when called from ZFS arc_prune_task > | >>> - nullfs would hang onto a bunch of vnodes unless mounted with > | >>> nocache > | >>> - nullfs and nocache would break untar. This has been fixed > now. > | >>> > | >>> With nullfs, nocache and settings max vnodes to a low number I can > | >>> keep the ARC around the max. without evict and prune consuming > | >>> 100% of 2 cores. This doesn't seem like the best solution but it > | >>> better then when the ARC starts spinning. > | >>> > | >>> Looking into this issue with bhyve and a md drive for testing I > create > | >>> a brand new zpool mounted as /test and then nullfs mount /test to > /mnt. > | >>> I loop through untaring the Linux kernel into the nullfs mount, rm > -rf > | >>> it > | >>> and repeat. I set the ARC to the smallest value I can. Untarring > the > | >>> Linux kernel was enough to get the ARC evict and prune to spin since > | >>> they couldn't evict/prune anything. > | >>> > | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it > | >>> static int > | >>> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode > *mvp) > | >>> { > | >>> ... > | >>> > | >>> for (;;) { > | >>> ... > | >>> vp = TAILQ_NEXT(vp, v_vnodelist); > | >>> ... > | >>> > | >>> /* > | >>> * Don't recycle if our vnode is from different type > | >>> * of mount point. Note that mp is type-safe, the > | >>> * check does not reach unmapped address even if > | >>> * vnode is reclaimed. > | >>> */ > | >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && > | >>> mp->mnt_op != mnt_op) { > | >>> continue; > | >>> } > | >>> ... > | >>> > | >>> The vp ends up being the nulfs mount and then hits the continue > | >>> even though the passed in mvp is on ZFS. If I do a hack to > | >>> comment out the continue then I see the ARC, nullfs vnodes and > | >>> ZFS vnodes grow. When the ARC calls arc_prune_task that calls > | >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. > | >>> The ARC cache usage also goes down. Then they increase again until > | >>> the ARC gets full and then they go down again. So with this hack > | >>> I don't need nocache passed to nullfs and I don't need to limit > | >>> the max vnodes. Doing multiple untars in parallel over and over > | >>> doesn't seem to cause any issues for this test. I'm not saying > | >>> commenting out continue is the fix but a simple POC test. > | >>> > | >> > | >> I don't see an easy way to say "this is a nullfs vnode holding onto a > | >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs > | >> callback, if the module is loaded. > | >> > | >> In the meantime I think a good enough(tm) fix would be to check that > | >> nothing was freed and fallback to good old regular clean up without > | >> filtering by vfsops. This would be very similar to what you are doing > | >> with your hack. > | >> > | > > | > Now that I wrote this perhaps an acceptable hack would be to extend > | > struct mount with a pointer to "lower layer" mount (if any) and patch > | > the vfsops check to also look there. > | > > | >> > | >>> It appears that when ZFS is asking for cached vnodes to be > | >>> free'd nullfs also needs to free some up as well so that > | >>> they are free'd on the VFS level. It seems that vnlru_free_impl > | >>> should allow some of the related nullfs vnodes to be free'd so > | >>> the ZFS ones can be free'd and reduce the size of the ARC. > | >>> > | >>> BTW, I also hacked the kernel and mount to show the vnodes used > | >>> per mount ie. mount -v: > | >>> test on /test (zfs, NFS exported, local, nfsv4acls, fsid > | >>> 2b23b2a1de21ed66, > | >>> vnodes: count 13846 lazy 0) > | >>> /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid > | >>> 11ff002929000000, vnodes: count 13846 lazy 0) > | >>> > | >>> Now I can easily see how the vnodes are used without going into ddb. > | >>> On my laptop I have various vnet jails and nullfs mount my homedir > into > | >>> them so pretty much everything goes through nullfs to ZFS. I'm > limping > | >>> along with the nullfs nocache and small number of vnodes but it would > be > | >>> nice to not need that. > | >>> > | >>> Thanks, > | >>> > | >>> Doug A. > | >>> > | >>> > | >> > | >> > | >> -- > | >> Mateusz Guzik > | >> > | > > | > > | > -- > | > Mateusz Guzik > | > > | > | > | -- > | Mateusz Guzik > -- Mateusz Guzik