Major issues with nfsv4

Thu Jan 14 22:30:50 UTC 2021

J David wrote:
>On Wed, Dec 16, 2020 at 11:25 PM Rick Macklem <rmacklem at uoguelph.ca> wrote:
>> If you can do so when the "Opens" count has gone fairly high,
>> please "sysctl vfs.deferred_inact" and let us know what that
>> returns.
>
>$ sysctl vfs.deferred_inact
>sysctl: unknown oid 'vfs.deferred_inact'
>$ sysctl -a vfs | fgrep defer
>$
Yes. I did not realize how different FreeBSD12 is when compared with FreeBSD13/head
in this area.
At a quick glance, I do not see where the syncer tries to vinactive() vnodes where
the VOP_INACTIVE() has been deferred.

--> It is possible that this problem is fixed in FreeBSD13/head.
       Any chance you can test a FreeBSD13/head system?

Kostik, does FreeBSD12 try to vinactive() deferred VOP_INACTIVE() vnodes via the
syncer?

>Sorry for the delay in responding to this.  I got my knuckles rapped
>for allowing this to happen so much.
>
>It happened just now because some of the "use NFSv4.1" config leaked
>out to a production machine, but not all of it. As a result, only the
>read-only "job binary" filesystems were mounted with nullfs+nfsv4.1.
>So it is unlikely to be related to creating files. Hopefully, that
>narrows things down.
>
>$ sudo nfsstat -E -c
>[...]
>  OpenOwner    Opens  LockOwner    Locks   Delegs  LocalOwn
>    37473   303469      0      0      1      0
>[...]
>
>"nfscl: never fnd open" continues to appear regularly on
>console/dmesg, even at the end of the reboot:
Not sure what this implies. The message means that it cannot find
a NFSv4 Open to Close.
It may indicate something is broken in the client, but is not by itself, serious.

>Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done
>Waiting (max 60 seconds) for system thread `bufspacedaemon-5' to stop... done
>Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done
>Waiting (max 60 seconds) for system thread `bufspacedaemon-6' to stop... done
>All buffers synced.
>nfscl: never fnd open
>nfscl: never fnd open
>nfscl: never fnd open
>nfscl: never fnd open
>nfscl: never fnd open
>nfscl: never fnd open
>Uptime: 4d13h59m27s
>Rebooting...
>cpu_reset: Stopping other CPUs
>---<<BOOT>>---
>
>It did not appear 300,000 times, though.  More like a few times a day.
>
>Also, I set up an idle system with the NFSv4.1+nullfs config, as
>requested. It has been up for 32 days and appears not to have leaked
>anything. But it does also have a fistful of those "nfscl: never fnd
>open" messages.
>
>There is also a third system in a test environment with the
>nullfs+nfsv4.1 config. That system is up 34 days, has no exhibited
>problems, and shows this:
>
>  OpenOwner    Opens  LockOwner    Locks   Delegs  LocalOwn
>     342    15098      2      0      0      0
>
>That machine shows one "nfscl: never fnd open" in the dmesg.
>
>A fourth system has the NFSv4.1-no-nullfs config in production with
>net.inet.ip.portrange.lowlast tweaked and a limit on simultaneous
>jobs.  That system had issues requiring a restart 18 days ago. It also
>occasionally gets "nfscl: never fnd open" in the dmesg and has
>relatively large Open numbers:
>
>As of right now:
>  OpenOwner    Opens  LockOwner    Locks   Delegs  LocalOwn
>    23214    46304      0      0      0      0
>
>The "OpenOwner" value on that system seems to swing dramatically,
>ranging between 45,000 to 10,000 in just a few minutes. It appears to
>correlate well to simultaneous jobs.
This sounds normal, since an OpenOwner refers to a process on the client
doing a file Open.

> The "Opens" value goes up and
>down a bit, but trends upward over time. However, when I found and
>killed one long-running job and unmounted its filesystems, "Opens"
>dropped 90% to around 4600. Note there are *no* nullfs mounts on that
>system.  So nullfs may not be a necessary component of the problem.
This also sounds reasonable. The NFSv4 Opens can only be closed once
the process doing the Open plus all chidren processes have closed the
file.
--> If a program is "lazy" and doesn't do closes, they won't
       happen until the process exits. And then children processes
       will also need to exit before it leaves zombie state.

One thing to try (other than a FreeBSD13/head system, if possible)
is the "oneopenown" mount option.
--> It can only be used on NFSv4.1 mounts (not NFSv4.0) and
      makes the mount only use one OpenOwner for all Opens
      instead of a different one for each process doing an Open.
      --> This would reduce the number of Opens for the case
            where multiple processes open the same file.
     --> It also simplifies the search for an Open, since there
            is only one for each file.

rick

As a next step, I will try to create a fake job that opens a ton of
files.  Then I'll test it on the binary read-only nullfs+nfsv4.1
mounts and on the system that runs nfsv4.1 directly.

Thanks!