Major issues with nfsv4

Fri Dec 11 23:08:24 UTC 2020

On Fri, Dec 11, 2020 at 2:52 PM J David <j.david.lists at gmail.com> wrote:

> Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not
> resolve our issue.  But I've narrowed down the problem to a harmful
> interaction between NFSv4 and nullfs.
>
> These FreeBSD NFS clients form a pool of application servers that run
> jobs for the application.  A given job needs read-write access to its
> data and read-only access to the set of binaries it needs to run.
>
> The job data is horizontally partitioned across a set of directory
> trees spread over one set of NFS servers.  A separate set of NFS
> servers store the read-only binary roots.
>
> The jobs are assigned to these machines by a scheduler.  A job might
> take five milliseconds or five days.
>
> Historically, we have mounted the job data trees and the various
> binary roots on each application server over NFSv3.  When a job
> starts, its setup binds the needed data and binaries into a jail via
> nullfs, then runs the job in the jail.  This approach has worked
> perfectly for 10+ years.
>
> After I switched a server to NFSv4.1 to test that recommendation, it
> started having the same load problems as NFSv4.  As a test, I altered
> it to mount NFS directly in the jails for both the data and the
> binaries.  As "nullfs-NFS" jobs finished and "direct NFS" jobs
> started, the load and CPU usage started to fall dramatically.
>
> The critical problem with this approach is that privileged TCP ports
> are a finite resource.  At two per job, this creates two issues.
>
> First, there's a hard limit on both simultaneous jobs per server
> inconsistent with the hardware's capabilities.  Second, due to
> TIME_WAIT, it places a hard limit on job throughput.  In practice,
> these limits also interfere with each other; the more simultaneous
> long jobs are running, the more impact TIME_WAIT has on short job
> throughput.
>
> While it's certainly possible to configure NFS not to require reserved
> ports, the slightest possibility of a non-root user establishing a
> session to the NFS server kills that as an option.
>
> Turning down TIME_WAIT helps, though the ability to do that only on
> the interface facing the NFS server would be more palatable than doing
> it globally.
>
> Adjusting net.inet.ip.portrange.lowlast does not seem to help.  The
> code at sys/nfs/krpc_subr.c correctly uses ports between
> IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto
> and ipport_lowlastauto.  But is that the correct place to look for
> NFSv4.1?
>
> How explosive would adding SO_REUSEADDR to the NFS client be?  It's
> not a full solution, but it would handle the TIME_WAIT side of the
> issue.
>
> Even so, there may be no workaround for the simultaneous mount limit
> as long as reserved ports are required.  Solving the negative
> interaction with nullfs seems like the only long-term fix.
>
> What would be a good next step there?
>
> Thanks!
>

That's some good information.  However, it must not be the whole story.
I've been nullfs mounting my NFS mounts for years.  For example, right now
on a FreeBSD 12.2-RC2 machine:

> sudo nfsstat -m
Password:
192.168.0.2:/home on /usr/home
nfsv4,minorversion=1,tcp,resvport,soft,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2147483647
> mount | grep home
192.168.0.2:/home on /usr/home (nfs, nfsv4acls)
/usr/home on /iocage/jails/rustup2/root/usr/home (nullfs)

Are you using any mount options with nullfs?  It might be worth trying to
make the read-only mount into read-write, to see if that helps.  And what
does "jls -n" show?
-Alan