Major issues with nfsv4
asomers at freebsd.org
Fri Dec 11 23:08:24 UTC 2020
On Fri, Dec 11, 2020 at 2:52 PM J David <j.david.lists at gmail.com> wrote:
> Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not
> resolve our issue. But I've narrowed down the problem to a harmful
> interaction between NFSv4 and nullfs.
> These FreeBSD NFS clients form a pool of application servers that run
> jobs for the application. A given job needs read-write access to its
> data and read-only access to the set of binaries it needs to run.
> The job data is horizontally partitioned across a set of directory
> trees spread over one set of NFS servers. A separate set of NFS
> servers store the read-only binary roots.
> The jobs are assigned to these machines by a scheduler. A job might
> take five milliseconds or five days.
> Historically, we have mounted the job data trees and the various
> binary roots on each application server over NFSv3. When a job
> starts, its setup binds the needed data and binaries into a jail via
> nullfs, then runs the job in the jail. This approach has worked
> perfectly for 10+ years.
> After I switched a server to NFSv4.1 to test that recommendation, it
> started having the same load problems as NFSv4. As a test, I altered
> it to mount NFS directly in the jails for both the data and the
> binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs
> started, the load and CPU usage started to fall dramatically.
> The critical problem with this approach is that privileged TCP ports
> are a finite resource. At two per job, this creates two issues.
> First, there's a hard limit on both simultaneous jobs per server
> inconsistent with the hardware's capabilities. Second, due to
> TIME_WAIT, it places a hard limit on job throughput. In practice,
> these limits also interfere with each other; the more simultaneous
> long jobs are running, the more impact TIME_WAIT has on short job
> While it's certainly possible to configure NFS not to require reserved
> ports, the slightest possibility of a non-root user establishing a
> session to the NFS server kills that as an option.
> Turning down TIME_WAIT helps, though the ability to do that only on
> the interface facing the NFS server would be more palatable than doing
> it globally.
> Adjusting net.inet.ip.portrange.lowlast does not seem to help. The
> code at sys/nfs/krpc_subr.c correctly uses ports between
> IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto
> and ipport_lowlastauto. But is that the correct place to look for
> How explosive would adding SO_REUSEADDR to the NFS client be? It's
> not a full solution, but it would handle the TIME_WAIT side of the
> Even so, there may be no workaround for the simultaneous mount limit
> as long as reserved ports are required. Solving the negative
> interaction with nullfs seems like the only long-term fix.
> What would be a good next step there?
That's some good information. However, it must not be the whole story.
I've been nullfs mounting my NFS mounts for years. For example, right now
on a FreeBSD 12.2-RC2 machine:
> sudo nfsstat -m
192.168.0.2:/home on /usr/home
> mount | grep home
192.168.0.2:/home on /usr/home (nfs, nfsv4acls)
/usr/home on /iocage/jails/rustup2/root/usr/home (nullfs)
Are you using any mount options with nullfs? It might be worth trying to
make the read-only mount into read-write, to see if that helps. And what
does "jls -n" show?
More information about the freebsd-fs