Re: nfs client's OpenOwner count increases without bounds

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Thu, 05 May 2022 00:56:05 UTC
Alan Somers <asomers@freebsd.org> wrote:
> On Wed, May 4, 2022 at 5:23 PM Rick Macklem <rmacklem@uoguelph.ca> wrote:
> >
> > Alan Somers <asomers@freebsd.org> wrote:
> > > I have a FreeBSD 13 (tested on both 13.0-RELEASE and 13.1-RC5) desktop
> > > mounting /usr/home over NFS 4.2 from an 13.0-RELEASE server.  It
> > > worked fine until a few weeks ago.  Now, the desktop's performance
> > > slowly degrades.  It becomes less and less responsive until I restart
> > > X after 2-3 days.  /var/log/Xorg.0.log shows plenty of entries like
> > > "AT keyboard: client bug: event processing lagging behind by 112ms,
> > > your system is too slow".  "top -S" shows that the busiest process is
> > > nfscl.  A dtrace profile shows that nfscl is spending most of its time
> > > in nfscl_cleanup_common, in the loop over all nfsclowner objects.
> > > Running "nfsdumpstate" on the server shows thousands of OpenOwners for
> > > that client, and < 10 for any other NFS client.  The OpenOwners
> > > increases by about 3000 per day.  And yet, "fstat" shows only a couple
> > > hundred open files on the NFS file system.  Why are OpenOwners so
> > > high?  Killing most of my desktop processes doesn't seem to make a
> > > difference.  Restarting X does improve the perceived responsiveness,
> > > though it does not change the number of OpenOwners.
> > >
> > > How can I figure out which process(es) are responsible for the
> > > excessive OpenOwners?
> > An OpenOwner represents a process on the client. The OpenOwner
> > name is an encoding of pid + process startup time.
> > However, I can't think of an easy way to get at the OpenOwner name.
> >
> > Now, why aren't they going away, hmm..
> >
> > I'm assuming the # of Opens is not large?
> > (Openowners cannot go away until all associated opens
> >  are closed.)
> 
> Oh, I didn't mention that yes the number of Opens is large.  Right
> now, for example, I have 7950 OpenOwner and 8277 Open.
Well, the openowners cannot go away until the opens go away,
so the problem is that the opens are not getting closed.

Close happens when the v_usecount on the vnode goes to zero.
Something is retaining the v_usecount. One possibility is that most
of the opens are for the same file, but with different openowners.
If that is the case, the "oneopenown" mount option will deal with it.

Another possibility is that something is retaining a v_usecount
reference on a lot of the vnodes. (This used to happen when a nullfs
mount with caching enabled was on top of the nfs mount.)
I don't know what other things might do that?

> >
> > Commit 1cedb4ea1a79 in main changed the semantics of this
> > a little, to avoid a use-after-free bug. However, it is dated
> > Feb. 25, 2022 and is not in 13.0, so I don't think it could
> > be the culprit.
> >
> > Essentially, the function called nfscl_cleanupkext() should call
> > nfscl_procdoesntexist(), which returns true after the process has
> > exited and when that is the case, calls nfscl_cleanup_common().
> > --> nfscl_cleanup_common() will either get rid of the openowner or,
> >       if there are still children with open file descriptors, mark it "defunct"
> >       so it can be free'd once the children close the file.
> >
> > It could be that X is now somehow creating a long chain of processes
> > where the children inherit a file descriptor and that delays the cleanup
> > indefinitely?
> > Even then, everything should get cleaned up once you kill off X?
> > (It might take a couple of seconds after killing all the processes off.)
> >
> > Another possibility is that the "nfscl" thread is wedged somehow.
> > It is the one that will call nfscl_cleanupkext() once/sec. If it never
> > gets called, the openowners will never go away.
> >
> > Being old fashioned, I'd probably try to figure this out by adding
> > some printf()s to nfscl_cleanupkext() and nfscl_cleanup_common().
> 
> dtrace shows that nfscl_cleanupkext() is getting called at about 0.6 hz.
That sounds ok. Since there are a lot of opens/openowners, it probably
is getting behind.

> >
> > To avoid the problem, you can probably just use the "oneopenown"
> > mount option. With that option, only one openowner is used for
> > all opens. (Having separate openowners for each process was needed
> > for NFSv4.0, but not NFSv4.1/4.2.)
> >
> > > Or is it just a red herring and I shouldn't
> > > worry?
> > Well, you can probably avoid the problem by using the "oneopenown"
> > mount option.
> 
> Ok, I'm trying that now.  After unmounting and remounting NFS,
> "nfsstat -cE" reports 1 OpenOwner and 11 Opens".  But on the server,
> "nfsdumpstate" still reports thousands.  Will those go away
> eventually?
If the opens are gone then, yes, they will go away. They are retained for
a little while so that another Open against the openowner does not need
to recreate the openowner (which also implied an extra RPC to confirm
the openowner in NFSv4.0).

I think they go away after a few minutes, if I recall correctly.
If the server thinks there are still Opens, then they will not go away.

rick

>
> Thanks for reporting this, rick
> ps: And, yes, large numbers of openowners will slow things down,
>       since the code ends up doing linear scans of them all in a linked
>       list in various places.
>
> -Alan
>