Major issues with nfsv4

Alan Somers asomers at freebsd.org
Fri Dec 11 23:35:35 UTC 2020


On Fri, Dec 11, 2020 at 4:28 PM Rick Macklem <rmacklem at uoguelph.ca> wrote:

> J David wrote:
> >Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not
> >resolve our issue.  But I've narrowed down the problem to a harmful
> >interaction between NFSv4 and nullfs.
> I am afraid I know nothing about nullfs and jails. I suspect it will be
> something related to when file descriptors in the NFS client mount
> get closed.
>
> The NFSv4 Open is a Windows Open lock and has nothing to do with
> a POSIX open. Since only one of these can exist for each
> <client process, file> tuple, the NFSv4 Close must be delayed until
> all POSIX Opens on the file have been closed, including open file
> descriptors inherited by children processes.
>

Does it make a difference whether the files are opened read-only or
read-write?  My longstanding practice has been to never use NFS to store
object files while compiling.  I do that for performance reasons, and I
didn't think that nullfs had anything to do with it (but maybe it does).


>
> Someone else recently reported problems using nullfs and vnet jails.
>
> >These FreeBSD NFS clients form a pool of application servers that run
> >jobs for the application.  A given job needs read-write access to its
> >data and read-only access to the set of binaries it needs to run.
> >
> >The job data is horizontally partitioned across a set of directory
> >trees spread over one set of NFS servers.  A separate set of NFS
> >servers store the read-only binary roots.
> >
> >The jobs are assigned to these machines by a scheduler.  A job might
> >take five milliseconds or five days.
> >
> >Historically, we have mounted the job data trees and the various
> >binary roots on each application server over NFSv3.  When a job
> >starts, its setup binds the needed data and binaries into a jail via
> >nullfs, then runs the job in the jail.  This approach has worked
> >perfectly for 10+ years.
> Well, NFSv3 is not going away any time soon, so if you don't need
> any of the additional features it offers...
>
> >After I switched a server to NFSv4.1 to test that recommendation, it
> >started having the same load problems as NFSv4.  As a test, I altered
> >it to mount NFS directly in the jails for both the data and the
> >binaries.  As "nullfs-NFS" jobs finished and "direct NFS" jobs
> >started, the load and CPU usage started to fall dramatically.
> Good work isolating the problem. Imay try playing with NFSv4/nullfs
> someday soon and see if I can break it.
>
> >The critical problem with this approach is that privileged TCP ports
> >are a finite resource.  At two per job, this creates two issues.
> >
> >First, there's a hard limit on both simultaneous jobs per server
> >inconsistent with the hardware's capabilities.  Second, due to
> >TIME_WAIT, it places a hard limit on job throughput.  In practice,
> >these limits also interfere with each other; the more simultaneous
> >long jobs are running, the more impact TIME_WAIT has on short job
> >throughput.
> >
> >While it's certainly possible to configure NFS not to require reserved
> >ports, the slightest possibility of a non-root user establishing a
> >session to the NFS server kills that as an option.
> Personally, I've never thought the reserved port# requirement provided
> any real security for most situations. Unless you set "vfs.usermount=1"
> only root can do the mount. For non-root to mount the NFS server
> when "vfs.usermount=0", a user would have to run their own custom hacked
> userland NFS client. Although doable, I have never heard of it being done.
>

There are a few out there.  For example, https://github.com/sahlberg/libnfs
.


>
> rick
>
> Turning down TIME_WAIT helps, though the ability to do that only on
> the interface facing the NFS server would be more palatable than doing
> it globally.
>
> Adjusting net.inet.ip.portrange.lowlast does not seem to help.  The
> code at sys/nfs/krpc_subr.c correctly uses ports between
> IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto
> and ipport_lowlastauto.  But is that the correct place to look for
> NFSv4.1?
>
> How explosive would adding SO_REUSEADDR to the NFS client be?  It's
> not a full solution, but it would handle the TIME_WAIT side of the
> issue.
>
> Even so, there may be no workaround for the simultaneous mount limit
> as long as reserved ports are required.  Solving the negative
> interaction with nullfs seems like the only long-term fix.
>
> What would be a good next step there?
>
> Thanks!
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>


More information about the freebsd-fs mailing list