Re: nullfs and ZFS issues

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Wed, 20 Apr 2022 09:43:10 UTC
On 4/19/22, Doug Ambrisko <ambrisko@ambrisko.com> wrote:
> On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote:
> | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff
> |
> | this is not committable but should validate whether it works fine
>
> As a POC it's working.  I see the vnode count for the nullfs and
> ZFS go up.  The ARC cache also goes up until it exceeds the ARC max.
> size tten the vnodes for nullfs and ZFS goes down.  The ARC cache goes
> down as well.  This all repeats over and over.  The systems seems
> healthy.  No excessive running of arc_prune or arc_evict.
>
> My only comment is that the vnode freeing seems a bit agressive.
> Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS.
> The ARC drops from 70M to 7M (max is set at 64M) for this unit
> test.
>

Can you check what kind of shrinking is requested by arc to begin
with? I imagine encountering a nullfs vnode may end up recycling 2
instead of 1, but even repeated a lot it does not explain the above.

>
> | On 4/19/22, Mateusz Guzik <mjguzik@gmail.com> wrote:
> | > On 4/19/22, Mateusz Guzik <mjguzik@gmail.com> wrote:
> | >> On 4/19/22, Doug Ambrisko <ambrisko@ambrisko.com> wrote:
> | >>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
> | >>> localhost NFS mounts instead of nullfs when nullfs would complain
> | >>> that it couldn't mount.  Since that check has been removed, I've
> | >>> switched to nullfs only.  However, every so often my laptop would
> | >>> get slow and the the ARC evict and prune thread would consume two
> | >>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
> | >>> it to 2G now.  Looking into this has uncovered some issues:
> | >>>      -	nullfs would prevent vnlru_free_vfsops from doing anything
> | >>> 	when called from ZFS arc_prune_task
> | >>>      -	nullfs would hang onto a bunch of vnodes unless mounted with
> | >>> 	nocache
> | >>>      -	nullfs and nocache would break untar.  This has been fixed
> now.
> | >>>
> | >>> With nullfs, nocache and settings max vnodes to a low number I can
> | >>> keep the ARC around the max. without evict and prune consuming
> | >>> 100% of 2 cores.  This doesn't seem like the best solution but it
> | >>> better then when the ARC starts spinning.
> | >>>
> | >>> Looking into this issue with bhyve and a md drive for testing I
> create
> | >>> a brand new zpool mounted as /test and then nullfs mount /test to
> /mnt.
> | >>> I loop through untaring the Linux kernel into the nullfs mount, rm
> -rf
> | >>> it
> | >>> and repeat.  I set the ARC to the smallest value I can.  Untarring
> the
> | >>> Linux kernel was enough to get the ARC evict and prune to spin since
> | >>> they couldn't evict/prune anything.
> | >>>
> | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
> | >>>   static int
> | >>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode
> *mvp)
> | >>>   {
> | >>> 	...
> | >>>
> | >>>         for (;;) {
> | >>> 	...
> | >>>                 vp = TAILQ_NEXT(vp, v_vnodelist);
> | >>> 	...
> | >>>
> | >>>                 /*
> | >>>                  * Don't recycle if our vnode is from different type
> | >>>                  * of mount point.  Note that mp is type-safe, the
> | >>>                  * check does not reach unmapped address even if
> | >>>                  * vnode is reclaimed.
> | >>>                  */
> | >>>                 if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
> | >>>                     mp->mnt_op != mnt_op) {
> | >>>                         continue;
> | >>>                 }
> | >>> 	...
> | >>>
> | >>> The vp ends up being the nulfs mount and then hits the continue
> | >>> even though the passed in mvp is on ZFS.  If I do a hack to
> | >>> comment out the continue then I see the ARC, nullfs vnodes and
> | >>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
> | >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
> | >>> The ARC cache usage also goes down.  Then they increase again until
> | >>> the ARC gets full and then they go down again.  So with this hack
> | >>> I don't need nocache passed to nullfs and I don't need to limit
> | >>> the max vnodes.  Doing multiple untars in parallel over and over
> | >>> doesn't seem to cause any issues for this test.  I'm not saying
> | >>> commenting out continue is the fix but a simple POC test.
> | >>>
> | >>
> | >> I don't see an easy way to say "this is a nullfs vnode holding onto a
> | >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
> | >> callback, if the module is loaded.
> | >>
> | >> In the meantime I think a good enough(tm) fix would be to check that
> | >> nothing was freed and fallback to good old regular clean up without
> | >> filtering by vfsops. This would be very similar to what you are doing
> | >> with your hack.
> | >>
> | >
> | > Now that I wrote this perhaps an acceptable hack would be to extend
> | > struct mount with a pointer to "lower layer" mount (if any) and patch
> | > the vfsops check to also look there.
> | >
> | >>
> | >>> It appears that when ZFS is asking for cached vnodes to be
> | >>> free'd nullfs also needs to free some up as well so that
> | >>> they are free'd on the VFS level.  It seems that vnlru_free_impl
> | >>> should allow some of the related nullfs vnodes to be free'd so
> | >>> the ZFS ones can be free'd and reduce the size of the ARC.
> | >>>
> | >>> BTW, I also hacked the kernel and mount to show the vnodes used
> | >>> per mount ie. mount -v:
> | >>>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid
> | >>> 2b23b2a1de21ed66,
> | >>> vnodes: count 13846 lazy 0)
> | >>>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
> | >>> 11ff002929000000, vnodes: count 13846 lazy 0)
> | >>>
> | >>> Now I can easily see how the vnodes are used without going into ddb.
> | >>> On my laptop I have various vnet jails and nullfs mount my homedir
> into
> | >>> them so pretty much everything goes through nullfs to ZFS.  I'm
> limping
> | >>> along with the nullfs nocache and small number of vnodes but it would
> be
> | >>> nice to not need that.
> | >>>
> | >>> Thanks,
> | >>>
> | >>> Doug A.
> | >>>
> | >>>
> | >>
> | >>
> | >> --
> | >> Mateusz Guzik <mjguzik gmail.com>
> | >>
> | >
> | >
> | > --
> | > Mateusz Guzik <mjguzik gmail.com>
> | >
> |
> |
> | --
> | Mateusz Guzik <mjguzik gmail.com>
>


-- 
Mateusz Guzik <mjguzik gmail.com>