Re: nfs client's OpenOwner count increases without bounds

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Wed, 04 May 2022 23:23:00 UTC
Alan Somers <asomers@freebsd.org> wrote:
> I have a FreeBSD 13 (tested on both 13.0-RELEASE and 13.1-RC5) desktop
> mounting /usr/home over NFS 4.2 from an 13.0-RELEASE server.  It
> worked fine until a few weeks ago.  Now, the desktop's performance
> slowly degrades.  It becomes less and less responsive until I restart
> X after 2-3 days.  /var/log/Xorg.0.log shows plenty of entries like
> "AT keyboard: client bug: event processing lagging behind by 112ms,
> your system is too slow".  "top -S" shows that the busiest process is
> nfscl.  A dtrace profile shows that nfscl is spending most of its time
> in nfscl_cleanup_common, in the loop over all nfsclowner objects.
> Running "nfsdumpstate" on the server shows thousands of OpenOwners for
> that client, and < 10 for any other NFS client.  The OpenOwners
> increases by about 3000 per day.  And yet, "fstat" shows only a couple
> hundred open files on the NFS file system.  Why are OpenOwners so
> high?  Killing most of my desktop processes doesn't seem to make a
> difference.  Restarting X does improve the perceived responsiveness,
> though it does not change the number of OpenOwners.
>
> How can I figure out which process(es) are responsible for the
> excessive OpenOwners?  
An OpenOwner represents a process on the client. The OpenOwner
name is an encoding of pid + process startup time.
However, I can't think of an easy way to get at the OpenOwner name.

Now, why aren't they going away, hmm..

I'm assuming the # of Opens is not large?
(Openowners cannot go away until all associated opens
 are closed.)

Commit 1cedb4ea1a79 in main changed the semantics of this
a little, to avoid a use-after-free bug. However, it is dated
Feb. 25, 2022 and is not in 13.0, so I don't think it could
be the culprit.

Essentially, the function called nfscl_cleanupkext() should call
nfscl_procdoesntexist(), which returns true after the process has
exited and when that is the case, calls nfscl_cleanup_common().
--> nfscl_cleanup_common() will either get rid of the openowner or,
      if there are still children with open file descriptors, mark it "defunct"
      so it can be free'd once the children close the file.

It could be that X is now somehow creating a long chain of processes
where the children inherit a file descriptor and that delays the cleanup
indefinitely?
Even then, everything should get cleaned up once you kill off X?
(It might take a couple of seconds after killing all the processes off.)

Another possibility is that the "nfscl" thread is wedged somehow.
It is the one that will call nfscl_cleanupkext() once/sec. If it never
gets called, the openowners will never go away.

Being old fashioned, I'd probably try to figure this out by adding
some printf()s to nfscl_cleanupkext() and nfscl_cleanup_common().

To avoid the problem, you can probably just use the "oneopenown"
mount option. With that option, only one openowner is used for
all opens. (Having separate openowners for each process was needed
for NFSv4.0, but not NFSv4.1/4.2.)

> Or is it just a red herring and I shouldn't
> worry?
Well, you can probably avoid the problem by using the "oneopenown"
mount option.

Thanks for reporting this, rick
ps: And, yes, large numbers of openowners will slow things down,
      since the code ends up doing linear scans of them all in a linked
      list in various places.

-Alan