Major issues with nfsv4

Fri Dec 11 23:28:32 UTC 2020

J David wrote:
>Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not
>resolve our issue.  But I've narrowed down the problem to a harmful
>interaction between NFSv4 and nullfs.
I am afraid I know nothing about nullfs and jails. I suspect it will be
something related to when file descriptors in the NFS client mount
get closed.

The NFSv4 Open is a Windows Open lock and has nothing to do with
a POSIX open. Since only one of these can exist for each
<client process, file> tuple, the NFSv4 Close must be delayed until
all POSIX Opens on the file have been closed, including open file
descriptors inherited by children processes.

Someone else recently reported problems using nullfs and vnet jails.

>These FreeBSD NFS clients form a pool of application servers that run
>jobs for the application.  A given job needs read-write access to its
>data and read-only access to the set of binaries it needs to run.
>
>The job data is horizontally partitioned across a set of directory
>trees spread over one set of NFS servers.  A separate set of NFS
>servers store the read-only binary roots.
>
>The jobs are assigned to these machines by a scheduler.  A job might
>take five milliseconds or five days.
>
>Historically, we have mounted the job data trees and the various
>binary roots on each application server over NFSv3.  When a job
>starts, its setup binds the needed data and binaries into a jail via
>nullfs, then runs the job in the jail.  This approach has worked
>perfectly for 10+ years.
Well, NFSv3 is not going away any time soon, so if you don't need
any of the additional features it offers...

>After I switched a server to NFSv4.1 to test that recommendation, it
>started having the same load problems as NFSv4.  As a test, I altered
>it to mount NFS directly in the jails for both the data and the
>binaries.  As "nullfs-NFS" jobs finished and "direct NFS" jobs
>started, the load and CPU usage started to fall dramatically.
Good work isolating the problem. Imay try playing with NFSv4/nullfs
someday soon and see if I can break it.

>The critical problem with this approach is that privileged TCP ports
>are a finite resource.  At two per job, this creates two issues.
>
>First, there's a hard limit on both simultaneous jobs per server
>inconsistent with the hardware's capabilities.  Second, due to
>TIME_WAIT, it places a hard limit on job throughput.  In practice,
>these limits also interfere with each other; the more simultaneous
>long jobs are running, the more impact TIME_WAIT has on short job
>throughput.
>
>While it's certainly possible to configure NFS not to require reserved
>ports, the slightest possibility of a non-root user establishing a
>session to the NFS server kills that as an option.
Personally, I've never thought the reserved port# requirement provided
any real security for most situations. Unless you set "vfs.usermount=1"
only root can do the mount. For non-root to mount the NFS server
when "vfs.usermount=0", a user would have to run their own custom hacked
userland NFS client. Although doable, I have never heard of it being done.

rick

Turning down TIME_WAIT helps, though the ability to do that only on
the interface facing the NFS server would be more palatable than doing
it globally.

Adjusting net.inet.ip.portrange.lowlast does not seem to help.  The
code at sys/nfs/krpc_subr.c correctly uses ports between
IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto
and ipport_lowlastauto.  But is that the correct place to look for
NFSv4.1?

How explosive would adding SO_REUSEADDR to the NFS client be?  It's
not a full solution, but it would handle the TIME_WAIT side of the
issue.

Even so, there may be no workaround for the simultaneous mount limit
as long as reserved ports are required.  Solving the negative
interaction with nullfs seems like the only long-term fix.

What would be a good next step there?

Thanks!