From nobody Wed Apr 20 16:11:03 2022 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 7E47511D2CD0 for ; Wed, 20 Apr 2022 16:11:11 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) Received: from mail2.ambrisko.com (mail2.ambrisko.com [70.91.206.91]) by mx1.freebsd.org (Postfix) with ESMTP id 4Kk5JG4L8Yz4RjD for ; Wed, 20 Apr 2022 16:11:10 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) IronPort-SDR: RYIqgUK2Jbv2l6ocabPyV6Yc3p52oyU2HaDmPqRdOL+VfbRVt7pB2wGrnvEhfzGcZXLxz6/4RF miTJbxcniJaj5UUrlgr1Bgsq0E2XuXP1A= X-Ambrisko-Me: Yes IronPort-Data: A9a23:aEKaL6500euyXA3J0oe8BwxRtEXHchMFZxGqfqrLsTDasY5as4F+v mAYDDvUaPneZzbxfdgnaIuxoE8G7JGDm4RkGQJp+XxhQiMRo6IpJzg5wmQcns+2BpeeJK5fA kl3huDodKjYdFeFzvuQGuOJQUdUhPjgqoXUWLas1hBZHWeIeQ954f5Rs7dRbr1A3bBVNziwV eba+KUzDrMFNwlcaQr444rbwP9mUW+bVDkw5jTSbtgT1LPSeuV84Dvy+MiMw3XErol8RoZWR s7Cyq205GXQ+1EkD9m/k634dQsBRbu60Qqm0ysMHfH80l4b4HZaPqUTbJLwbW9ejj+Tnstyz /1EsJaqSBwqOevHn+F1vxxwTngkZvEZkFPACT3l2SCJ9GXDcXTx0fRtJE4zNIwcvO1wBAlm+ +YVJToWYlWImviszbSnYud2i8kpN8WtO5kQ0kyMZxmx4e0OWp3ZXajQv5lR2T0qh9tNGrDVY M9xVNamVzyYCzUnB7vdIMtWcD6AiiatfjtGhkiSoKZrsWHfwBYrierkNdDPe8eJQu1cm0yCp 3nF+CLyBRRDbI6Tzj+M83SNgO7TnHOmANtDSOXgrvM60keOwmEzCQENUQfpq/eOlUPjCclUL FYZ+3RyoPFqplCrVNT0QzaxvGWA4kwHQ9NVHuBjsFONx6PY7hy3HG8BSjIdOtUquNVsHG4j0 1WTnsjqAhRmtbePSGme8fGfqjbrYXoZKmoLZCklSwoZ4om++Nhi0kqXFts6Sfy7lNz4Hz300 gumlilmiuVBl9MP2oW64UvD32CmqK/WQ1Nn/Q7QRG+ksF90Pdb3e4yy5FHHxv9cN4LFHEKZt X0JlsXCvuADCZaByH6ETOkXRuj75vCZPSfaiFopFpwr7TW2+HnldodVuWksKEBsO8cCWDnof E6D5FsItcMLZCOnPf1tfoa8K8U21qyxR93qW8fdYsdKfpUsJhSM+ztjZBLI0m2xwlIgl7ozZ cWSfcq2Vy5IEql90jesHaEU1LUxxzs9wiXYQpWil0ar1r+XZXi0T7YZMQvTNrlosPvc+AiFo cxCM8aqyglEVLysaybaxocfMFQWICVpHpvxscFWKraOLwcO9LvN0BMNLWfNo7BYopk= IronPort-HdrOrdr: A9a23:prO03q/7/SLT/vfeCDNuk+DYI+orL9Y04lQ7vn2ZhyY1TiW9rb HIoB17726RtN9/Yh0dcLy7V5VoBEmsk6KdgrNhWItKPjOW21dARbsKheCO/9SjIVydygc378 ddmsZFZuEZPTJB5/rH3A== Received: from server2.ambrisko.com (HELO internal.ambrisko.com) ([192.168.1.2]) by ironport2.ambrisko.com with ESMTP; 20 Apr 2022 08:07:38 -0700 Received: from ambrisko.com (localhost [127.0.0.1]) by internal.ambrisko.com (8.17.1/8.17.1) with ESMTPS id 23KGB3Vl076446 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Wed, 20 Apr 2022 09:11:03 -0700 (PDT) (envelope-from ambrisko@ambrisko.com) X-Authentication-Warning: internal.ambrisko.com: Host localhost [127.0.0.1] claimed to be ambrisko.com Received: (from ambrisko@localhost) by ambrisko.com (8.17.1/8.17.1/Submit) id 23KGB3vu076445; Wed, 20 Apr 2022 09:11:03 -0700 (PDT) (envelope-from ambrisko) Date: Wed, 20 Apr 2022 09:11:03 -0700 From: Doug Ambrisko To: Mateusz Guzik Cc: freebsd-current@freebsd.org Subject: Re: nullfs and ZFS issues Message-ID: References: List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 4Kk5JG4L8Yz4RjD X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of ambrisko@ambrisko.com has no SPF policy when checking 70.91.206.91) smtp.mailfrom=ambrisko@ambrisko.com X-Spamd-Result: default: False [-1.02 / 15.00]; R_SPF_NA(0.00)[no SPF record]; ARC_NA(0.00)[]; FREEFALL_USER(0.00)[ambrisko]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; NEURAL_HAM_MEDIUM(-0.03)[-0.026]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; HAS_XAW(0.00)[]; DMARC_NA(0.00)[ambrisko.com]; AUTH_NA(1.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-0.998]; RCPT_COUNT_TWO(0.00)[2]; MLMMJ_DEST(0.00)[freebsd-current]; FREEMAIL_TO(0.00)[gmail.com]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:7922, ipnet:70.88.0.0/14, country:US]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On Wed, Apr 20, 2022 at 11:43:10AM +0200, Mateusz Guzik wrote: | On 4/19/22, Doug Ambrisko wrote: | > On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote: | > | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff | > | | > | this is not committable but should validate whether it works fine | > | > As a POC it's working. I see the vnode count for the nullfs and | > ZFS go up. The ARC cache also goes up until it exceeds the ARC max. | > size tten the vnodes for nullfs and ZFS goes down. The ARC cache goes | > down as well. This all repeats over and over. The systems seems | > healthy. No excessive running of arc_prune or arc_evict. | > | > My only comment is that the vnode freeing seems a bit agressive. | > Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS. | > The ARC drops from 70M to 7M (max is set at 64M) for this unit | > test. | > | | Can you check what kind of shrinking is requested by arc to begin | with? I imagine encountering a nullfs vnode may end up recycling 2 | instead of 1, but even repeated a lot it does not explain the above. I dug it into a bit more and think there could be a bug in: module/zfs/arc.c arc_evict_meta_balanced(uint64_t meta_used) prune += zfs_arc_meta_prune; //arc_prune_async(prune); arc_prune_async(zfs_arc_meta_prune); Since arc_prune_async, is queuing up a run of arc_prune_task for each call it is actually already accumulating the zfs_arc_meta_prune amount. It makes the count to vnlru_free_impl get really big quickly since it is looping via restart. 1 HELLO arc_prune_task 164 ticks 2147465958 count 20480000 dmesg | grep arc_prune_task | uniq -c 14 HELLO arc_prune_task 164 ticks -2147343772 count 100 50 HELLO arc_prune_task 164 ticks -2147343771 count 100 46 HELLO arc_prune_task 164 ticks -2147343770 count 100 49 HELLO arc_prune_task 164 ticks -2147343769 count 100 44 HELLO arc_prune_task 164 ticks -2147343768 count 100 116 HELLO arc_prune_task 164 ticks -2147343767 count 100 1541 HELLO arc_prune_task 164 ticks -2147343766 count 100 53 HELLO arc_prune_task 164 ticks -2147343101 count 100 100 HELLO arc_prune_task 164 ticks -2147343100 count 100 75 HELLO arc_prune_task 164 ticks -2147343099 count 100 52 HELLO arc_prune_task 164 ticks -2147343098 count 100 50 HELLO arc_prune_task 164 ticks -2147343097 count 100 51 HELLO arc_prune_task 164 ticks -2147343096 count 100 783 HELLO arc_prune_task 164 ticks -2147343095 count 100 884 HELLO arc_prune_task 164 ticks -2147343094 count 100 Note I shrunk vfs.zfs.arc.meta_prune=100 to see how that might help. Changing it to 1, helps more! I see less agressive swings. I added printf("HELLO %s %d ticks %d count %ld\n",__FUNCTION__,__LINE__,ticks,nr_scan); to arc_prune_task. Adjusting both sysctl vfs.zfs.arc.meta_adjust_restarts=1 sysctl vfs.zfs.arc.meta_prune=100 without changing arc_prune_async(prune) helps avoid excessive swings. Thanks, Doug A. | > | On 4/19/22, Mateusz Guzik wrote: | > | > On 4/19/22, Mateusz Guzik wrote: | > | >> On 4/19/22, Doug Ambrisko wrote: | > | >>> I've switched my laptop to use nullfs and ZFS. Previously, I used | > | >>> localhost NFS mounts instead of nullfs when nullfs would complain | > | >>> that it couldn't mount. Since that check has been removed, I've | > | >>> switched to nullfs only. However, every so often my laptop would | > | >>> get slow and the the ARC evict and prune thread would consume two | > | >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased | > | >>> it to 2G now. Looking into this has uncovered some issues: | > | >>> - nullfs would prevent vnlru_free_vfsops from doing anything | > | >>> when called from ZFS arc_prune_task | > | >>> - nullfs would hang onto a bunch of vnodes unless mounted with | > | >>> nocache | > | >>> - nullfs and nocache would break untar. This has been fixed | > now. | > | >>> | > | >>> With nullfs, nocache and settings max vnodes to a low number I can | > | >>> keep the ARC around the max. without evict and prune consuming | > | >>> 100% of 2 cores. This doesn't seem like the best solution but it | > | >>> better then when the ARC starts spinning. | > | >>> | > | >>> Looking into this issue with bhyve and a md drive for testing I | > create | > | >>> a brand new zpool mounted as /test and then nullfs mount /test to | > /mnt. | > | >>> I loop through untaring the Linux kernel into the nullfs mount, rm | > -rf | > | >>> it | > | >>> and repeat. I set the ARC to the smallest value I can. Untarring | > the | > | >>> Linux kernel was enough to get the ARC evict and prune to spin since | > | >>> they couldn't evict/prune anything. | > | >>> | > | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it | > | >>> static int | > | >>> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode | > *mvp) | > | >>> { | > | >>> ... | > | >>> | > | >>> for (;;) { | > | >>> ... | > | >>> vp = TAILQ_NEXT(vp, v_vnodelist); | > | >>> ... | > | >>> | > | >>> /* | > | >>> * Don't recycle if our vnode is from different type | > | >>> * of mount point. Note that mp is type-safe, the | > | >>> * check does not reach unmapped address even if | > | >>> * vnode is reclaimed. | > | >>> */ | > | >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && | > | >>> mp->mnt_op != mnt_op) { | > | >>> continue; | > | >>> } | > | >>> ... | > | >>> | > | >>> The vp ends up being the nulfs mount and then hits the continue | > | >>> even though the passed in mvp is on ZFS. If I do a hack to | > | >>> comment out the continue then I see the ARC, nullfs vnodes and | > | >>> ZFS vnodes grow. When the ARC calls arc_prune_task that calls | > | >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. | > | >>> The ARC cache usage also goes down. Then they increase again until | > | >>> the ARC gets full and then they go down again. So with this hack | > | >>> I don't need nocache passed to nullfs and I don't need to limit | > | >>> the max vnodes. Doing multiple untars in parallel over and over | > | >>> doesn't seem to cause any issues for this test. I'm not saying | > | >>> commenting out continue is the fix but a simple POC test. | > | >>> | > | >> | > | >> I don't see an easy way to say "this is a nullfs vnode holding onto a | > | >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs | > | >> callback, if the module is loaded. | > | >> | > | >> In the meantime I think a good enough(tm) fix would be to check that | > | >> nothing was freed and fallback to good old regular clean up without | > | >> filtering by vfsops. This would be very similar to what you are doing | > | >> with your hack. | > | >> | > | > | > | > Now that I wrote this perhaps an acceptable hack would be to extend | > | > struct mount with a pointer to "lower layer" mount (if any) and patch | > | > the vfsops check to also look there. | > | > | > | >> | > | >>> It appears that when ZFS is asking for cached vnodes to be | > | >>> free'd nullfs also needs to free some up as well so that | > | >>> they are free'd on the VFS level. It seems that vnlru_free_impl | > | >>> should allow some of the related nullfs vnodes to be free'd so | > | >>> the ZFS ones can be free'd and reduce the size of the ARC. | > | >>> | > | >>> BTW, I also hacked the kernel and mount to show the vnodes used | > | >>> per mount ie. mount -v: | > | >>> test on /test (zfs, NFS exported, local, nfsv4acls, fsid | > | >>> 2b23b2a1de21ed66, | > | >>> vnodes: count 13846 lazy 0) | > | >>> /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid | > | >>> 11ff002929000000, vnodes: count 13846 lazy 0) | > | >>> | > | >>> Now I can easily see how the vnodes are used without going into ddb. | > | >>> On my laptop I have various vnet jails and nullfs mount my homedir | > into | > | >>> them so pretty much everything goes through nullfs to ZFS. I'm | > limping | > | >>> along with the nullfs nocache and small number of vnodes but it would | > be | > | >>> nice to not need that. | > | >>> | > | >>> Thanks, | > | >>> | > | >>> Doug A. | > | >>> | > | >>> | > | >> | > | >> | > | >> -- | > | >> Mateusz Guzik | > | >> | > | > | > | > | > | > -- | > | > Mateusz Guzik | > | > | > | | > | | > | -- | > | Mateusz Guzik | > | | | -- | Mateusz Guzik