Re: git: 6468cd8e0ef9 - main - mount: add vnode usage per file system with mount -v

From: Cy Schubert <Cy.Schubert_at_cschubert.com>
Date: Wed, 15 Jun 2022 03:08:33 UTC
In message <CAGudoHHbf3PTt_TXnKAtt5n_WOYpvA0dCb9HAiDLAL+yR3VCFQ@mail.gmail.c
om>
, Mateusz Guzik writes:
> On 6/13/22, Doug Ambrisko <ambrisko@ambrisko.com> wrote:
> > On Mon, Jun 13, 2022 at 06:43:31PM +0200, Mateusz Guzik wrote:
> > | On 6/13/22, Doug Ambrisko <ambrisko@freebsd.org> wrote:
> > | > The branch main has been updated by ambrisko:
> > | >
> > | > URL:
> > | >
> > https://cgit.FreeBSD.org/src/commit/?id=6468cd8e0ef9d1d3331e9de26cd2be59bc7
> 78494
> > | >
> > | > commit 6468cd8e0ef9d1d3331e9de26cd2be59bc778494
> > | > Author:     Doug Ambrisko <ambrisko@FreeBSD.org>
> > | > AuthorDate: 2022-06-13 14:56:38 +0000
> > | > Commit:     Doug Ambrisko <ambrisko@FreeBSD.org>
> > | > CommitDate: 2022-06-13 14:56:38 +0000
> > | >
> > | >     mount: add vnode usage per file system with mount -v
> > | >
> > | >     This avoids the need to drop into the ddb to figure out vnode
> > | >     usage per file system.  It helps to see if they are or are not
> > | >     being freed.  Suggestion to report active vnode count was from
> > | >     kib@
> > | >
> > | >     Reviewed by:    kib
> > | >     Differential Revision: https://reviews.freebsd.org/D35436
> > | > ---
> > | >  sbin/mount/mount.c   |  7 +++++++
> > | >  sys/kern/vfs_mount.c | 12 ++++++++++++
> > | >  sys/sys/mount.h      |  4 +++-
> > | >  3 files changed, 22 insertions(+), 1 deletion(-)
> > | >
> > | > diff --git a/sbin/mount/mount.c b/sbin/mount/mount.c
> > | > index 79d9d6cb0caf..bd3d0073c474 100644
> > | > --- a/sbin/mount/mount.c
> > | > +++ b/sbin/mount/mount.c
> > | > @@ -692,6 +692,13 @@ prmount(struct statfs *sfp)
> > | >  			xo_emit("{D:, }{Lw:fsid}{:fsid}", fsidbuf);
> > | >  			free(fsidbuf);
> > | >  		}
> > | > +		if (sfp->f_nvnodelistsize != 0 || sfp->f_avnodecount !=
>  0) {
> > | > +			xo_open_container("vnodes");
> > | > +			xo_emit("{D:,
> > | > }{Lwc:vnodes}{Lw:count}{w:count/%ju}{Lw:active}{:active/%ju}",
> > | > +			    (uintmax_t)sfp->f_nvnodelistsize,
> > | > +			    (uintmax_t)sfp->f_avnodecount);
> > | > +			xo_close_container("vnodes");
> > | > +		}
> > | >  	}
> > | >  	xo_emit("{D:)}\n");
> > | >  }
> > | > diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
> > | > index 71a40fd97a9c..e3818b67e841 100644
> > | > --- a/sys/kern/vfs_mount.c
> > | > +++ b/sys/kern/vfs_mount.c
> > | > @@ -2610,6 +2610,8 @@ vfs_copyopt(struct vfsoptlist *opts, const char
> > *name,
> > | > void *dest, int len)
> > | >  int
> > | >  __vfs_statfs(struct mount *mp, struct statfs *sbp)
> > | >  {
> > | > +	struct vnode *vp;
> > | > +	uint32_t count;
> > | >
> > | >  	/*
> > | >  	 * Filesystems only fill in part of the structure for updates, 
> we
> > | > @@ -2624,6 +2626,16 @@ __vfs_statfs(struct mount *mp, struct statfs
> > *sbp)
> > | >  	sbp->f_version = STATFS_VERSION;
> > | >  	sbp->f_namemax = NAME_MAX;
> > | >  	sbp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK;
> > | > +	sbp->f_nvnodelistsize = mp->mnt_nvnodelistsize;
> > | > +
> > | > +	count = 0;
> > | > +	MNT_ILOCK(mp);
> > | > +	TAILQ_FOREACH(vp, &mp->mnt_nvnodelist, v_nmntvnodes) {
> > | > +		if (vrefcnt(vp) > 0) /* racy but does not matter */
> > | > +			count++;
> > | > +	}
> > | > +	MNT_IUNLOCK(mp);
> > | > +	sbp->f_avnodecount = count;
> > | >
> > |
> > | libc uses statfs for dir walk (see gen/fts.c), most notably find
> > | immediately runs into it. As such the linear scan by default is a
> > | non-starter.
> > |
> > | I don't know if mount is the right place to dump this kind of info to
> > | begin with, but even so, it should only happen with a dedicated flag.
> > |
> > | As statfs does not take any flags on its own, there is no way to
> > | prevent it from doing the above walk. Perhaps a dedicated sysctl which
> > | takes mount point id could do the walk instead, when asked.
> > |
> > | Short of making the walk optional I'm afraid this will have to be
> > reverted.
> >
> > Just to be clear, this isn't breaking things but is not optimal for
> > things that don't need this extra info.
> >
>
> It's not "not optimal", it's a significant overhead which taxes
> frequent users which don't benefit from it.

Indeed this is not optimal. Since this revision NFSv4 performance has 
tanked. An installworld over NFSv4 which used to take approximately 15 
minutes took all night, not even finishing by morning. The NFS server, a 4 
core machine, had a load average of 18 with three NFS clients attempting 
installworld with nfsd using over 90% of the cycles on the machine. I was 
able to reproduce the problem by running a series of tar cf /dev/null 
/usr/obj in parallel, on a single NFSv4 client to essentially DoS the NFSv4 
server.

The workaround was to fall back to NFSv3, which was unaffected by this 
revision.

I reached out to our resident NFS person (rmacklem@) who suggested 
reverting this revision, restoring the network to pre-regression state.

>
> For more data I plugged dtrace -n 'fbt::__vfs_statfs:entry {
> @[execname] = count(); }' while package building, then i got tons of
> hits:
> [snip]
>   expr                                                          13992
>   install                                                       14090
>   dirname                                                       14921
>   mv                                                            17404
>   ghc-stage1                                                    17577
>   grep                                                          18998
>   xgcc                                                          23832
>   cpp                                                           29282
>   cc1                                                           36961
>   sh                                                            70575
>   rm                                                            73904
>   ld.lld                                                        87784
>   sed                                                           88803
>   c++                                                           98175
>   cat                                                          115811
>   cc                                                           449725
>
[...]
> -- 
> Mateusz Guzik <mjguzik gmail.com>
>


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  http://www.FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e**(i*pi)+1=0