Re: nullfs and ZFS issues

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Tue, 19 Apr 2022 09:47:22 UTC
Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff

this is not committable but should validate whether it works fine

On 4/19/22, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 4/19/22, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> On 4/19/22, Doug Ambrisko <ambrisko@ambrisko.com> wrote:
>>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
>>> localhost NFS mounts instead of nullfs when nullfs would complain
>>> that it couldn't mount.  Since that check has been removed, I've
>>> switched to nullfs only.  However, every so often my laptop would
>>> get slow and the the ARC evict and prune thread would consume two
>>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
>>> it to 2G now.  Looking into this has uncovered some issues:
>>>      -	nullfs would prevent vnlru_free_vfsops from doing anything
>>> 	when called from ZFS arc_prune_task
>>>      -	nullfs would hang onto a bunch of vnodes unless mounted with
>>> 	nocache
>>>      -	nullfs and nocache would break untar.  This has been fixed now.
>>>
>>> With nullfs, nocache and settings max vnodes to a low number I can
>>> keep the ARC around the max. without evict and prune consuming
>>> 100% of 2 cores.  This doesn't seem like the best solution but it
>>> better then when the ARC starts spinning.
>>>
>>> Looking into this issue with bhyve and a md drive for testing I create
>>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
>>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf
>>> it
>>> and repeat.  I set the ARC to the smallest value I can.  Untarring the
>>> Linux kernel was enough to get the ARC evict and prune to spin since
>>> they couldn't evict/prune anything.
>>>
>>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
>>>   static int
>>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
>>>   {
>>> 	...
>>>
>>>         for (;;) {
>>> 	...
>>>                 vp = TAILQ_NEXT(vp, v_vnodelist);
>>> 	...
>>>
>>>                 /*
>>>                  * Don't recycle if our vnode is from different type
>>>                  * of mount point.  Note that mp is type-safe, the
>>>                  * check does not reach unmapped address even if
>>>                  * vnode is reclaimed.
>>>                  */
>>>                 if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
>>>                     mp->mnt_op != mnt_op) {
>>>                         continue;
>>>                 }
>>> 	...
>>>
>>> The vp ends up being the nulfs mount and then hits the continue
>>> even though the passed in mvp is on ZFS.  If I do a hack to
>>> comment out the continue then I see the ARC, nullfs vnodes and
>>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
>>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
>>> The ARC cache usage also goes down.  Then they increase again until
>>> the ARC gets full and then they go down again.  So with this hack
>>> I don't need nocache passed to nullfs and I don't need to limit
>>> the max vnodes.  Doing multiple untars in parallel over and over
>>> doesn't seem to cause any issues for this test.  I'm not saying
>>> commenting out continue is the fix but a simple POC test.
>>>
>>
>> I don't see an easy way to say "this is a nullfs vnode holding onto a
>> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
>> callback, if the module is loaded.
>>
>> In the meantime I think a good enough(tm) fix would be to check that
>> nothing was freed and fallback to good old regular clean up without
>> filtering by vfsops. This would be very similar to what you are doing
>> with your hack.
>>
>
> Now that I wrote this perhaps an acceptable hack would be to extend
> struct mount with a pointer to "lower layer" mount (if any) and patch
> the vfsops check to also look there.
>
>>
>>> It appears that when ZFS is asking for cached vnodes to be
>>> free'd nullfs also needs to free some up as well so that
>>> they are free'd on the VFS level.  It seems that vnlru_free_impl
>>> should allow some of the related nullfs vnodes to be free'd so
>>> the ZFS ones can be free'd and reduce the size of the ARC.
>>>
>>> BTW, I also hacked the kernel and mount to show the vnodes used
>>> per mount ie. mount -v:
>>>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid
>>> 2b23b2a1de21ed66,
>>> vnodes: count 13846 lazy 0)
>>>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
>>> 11ff002929000000, vnodes: count 13846 lazy 0)
>>>
>>> Now I can easily see how the vnodes are used without going into ddb.
>>> On my laptop I have various vnet jails and nullfs mount my homedir into
>>> them so pretty much everything goes through nullfs to ZFS.  I'm limping
>>> along with the nullfs nocache and small number of vnodes but it would be
>>> nice to not need that.
>>>
>>> Thanks,
>>>
>>> Doug A.
>>>
>>>
>>
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
>>
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>


-- 
Mateusz Guzik <mjguzik gmail.com>