Unstable NFS on recent CURRENT

Rick Macklem rmacklem at uoguelph.ca
Fri Mar 11 01:08:20 UTC 2016


Paul Mather wrote:
> On Mar 9, 2016, at 8:59 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
> > Paul Mather wrote:
> >> On Mar 8, 2016, at 7:49 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> >> 
> >>> Paul Mather wrote:
> >>>> On Mar 7, 2016, at 9:55 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> >>>> 
> >>>>> Paul Mather (forwarded by Ronald Klop) wrote:
> >>>>>> On Sun, 06 Mar 2016 02:57:03 +0100, Paul Mather
> >>>>>> <paul at gromit.dlib.vt.edu>
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> On my BeagleBone Black running 11-CURRENT (r296162) lately I have
> >>>>>>> been
> >>>>>>> having trouble with NFS.  I have been doing a buildworld and
> >>>>>>> buildkernel
> >>>>>>> with /usr/src and /usr/obj mounted via NFS.  Recently, this process
> >>>>>>> has
> >>>>>>> resulted in the buildworld failing at some point, with a variety of
> >>>>>>> errors (Segmentation fault; Permission denied; etc.).  Even a "ls
> >>>>>>> -alR"
> >>>>>>> of /usr/src doesn't manage to complete.  It errors out thus:
> >>>>>>> 
> >>>>>>> =====
> >>>>>>> [[...]]
> >>>>>>> total 0
> >>>>>>> ls: ./.svn/pristine/fe: Permission denied
> >>>>>>> 
> >>>>>>> ./.svn/pristine/ff:
> >>>>>>> total 0
> >>>>>>> ls: ./.svn/pristine/ff: Permission denied
> >>>>>>> ls: fts_read: Permission denied
> >>>>>>> =====
> >>>>>>> 
> >>>>>>> On the console, I get the following:
> >>>>>>> 
> >>>>>>> newnfs: server 'chumby.chumby.lan' error: fileid changed. fsid
> >>>>>>> 94790777:a4385de: expected fileid 0x4, got 0x2. (BROKEN NFS SERVER OR
> >>>>>>> MIDDLEWARE)
> >>>>>>> 
> >>> Oh, I had forgotten this. Here's the comment related to this error.
> >>> (about line#445 in sys/fs/nfsclient/nfs_clport.c):
> >>> 446                      * BROKEN NFS SERVER OR MIDDLEWARE
> >>> 447 	                 *
> >>> 448 	                 * Certain NFS servers (certain old proprietary
> >>> filers
> >>> ca.
> >>> 449 	                 * 2006) or broken middleboxes (e.g. WAN accelerator
> >>> products)
> >>> 450 	                 * will respond to GETATTR requests with results for
> >>> a
> >>> 451 	                 * different fileid.
> >>> 452 	                 *
> >>> 453 	                 * The WAN accelerator we've observed not only
> >>> serves
> >>> stale
> >>> 454 	                 * cache results for a given file, it also
> >>> occasionally serves
> >>> 455 	                 * results for wholly different files.  This causes
> >>> surprising
> >>> 456 	                 * problems; for example the cached size attribute
> >>> of
> >>> a file
> >>> 457 	                 * may truncate down and then back up, resulting in
> >>> zero
> >>> 458 	                 * regions in file contents read by applications.
> >>> We
> >>> observed
> >>> 459 	                 * this reliably with Clang and .c files during
> >>> parallel build.
> >>> 460 	                 * A pcap revealed packet fragmentation and GETATTR
> >>> RPC
> >>> 461 	                 * responses with wholly wrong fileids.
> >>> 
> >>> If you can connect the client->server with a simple switch (or just an
> >>> RJ45
> >>> cable), it
> >>> might be worth testing that way. (I don't recall the name of the
> >>> middleware
> >>> product, but
> >>> I think it was shipped by one of the major switch vendors. I also don't
> >>> know if the product
> >>> supports NFSv4?)
> >>> 
> >>> rick
> >> 
> >> 
> >> Currently, the client is connected to the server via a dumb gigabit
> >> switch,
> >> so it is already fairly direct.
> >> 
> >> As for the above error, it appeared on the console only once.  (Sorry if I
> >> made it sound like it appears every time.)
> >> 
> >> I just tried another buildworld attempt via NFS and it failed again.  This
> >> time, I get this on the BeagleBone Black console:
> >> 
> >> 	nfs_getpages: error 13
> >> 	vm_fault: pager read error, pid 5401 (install)
> >> 
> > 13 is EACCES and could be caused by what I mention below. (Any mount of a
> > file
> > system on the server unless "-S" is specified as a flag for mountd.)
> > 
> >> 
> >> The other thing I have noticed is that if I induce heavy load on the NFS
> >> server---e.g., by starting a Poudriere bulk build---then that provokes the
> >> client to crash much more readily.  For example, I started a NFS
> >> buildworld
> >> on the BeagleBone Black, and it seemed to be chugging along nicely.  The
> >> moment I kicked off a Poudriere build update of my packages on the NFS
> >> server, it crashed the buildworld on the NFS client.
> >> 
> > Try adding "-S" to mountd_flags on the server. Any time file systems are
> > mounted
> > (and Poudriere likes to do that, I am told), mount sends a SIGHUP to mountd
> > to
> > reload /etc/exports. When /etc/exports are being reloaded, there will be
> > access
> > errors for mounts (that are temporarily not exported) unless you specify
> > "-S"
> > (which makes mountd suspend the nfsd threads during the reload of
> > /etc/exports).
> > 
> > rick
> 
> 
> Bingo!  I think we may have a winner.  I added that flag to mountd_flags on
> the server and the "instability" appears to have gone away.
> 
> It may be that all along the NFS problems on the client just coincided with
> Poudriere runs on the server.  I build custom packages for my local machines
> using Poudriere so I use it quite a lot.  Maybe the Poudriere port should
> come with a warning at install to those using NFS that it may provoke
> disruption and suggest the addition of "-S"?  (Alternatively, maybe "-S"
> could become a default for mountd_flags?  Is there a downside from using it
> that means making it a default option is unsuitable?)
> 
Well, the first time I proposed "-S" the collective felt it wasn't the appropriate
solution to the "export reload" problem. The second time, the "collective" agreed
that it was ok as a non-default option. (Part of this story was an alternative to
mountd called nfse which did update exports atomically, but it never made it into
FreeBSD.) The only downside to making it a default is that it does change behaviour
and some might consider that a POLA violation. Others would consider it just a bug fix.
There was one report of long delays before exports got updated on a very busy server.
(I have a one line patch that fixes this, but that won't be committed into FreeBSD-current
 until April.)

Now that "-S" has been in FreeBSD for a couple of years, I am planning on asking
the "collective" (I usually post these kind of things on freebsd-fs@) to make it the
default in FreeBSD-current, because this problem seems to crop up fairly frequently.
I will probably post w.r.t. this in April when I can again to svn commits.

I only recently found out the Poudriere does mounts and causes this problem.
I may also commit a man page update (which can be MFC'd) that mentions if you
are using Poudriere you want this flag.
Having the same thing mentioned in the Poudriere port install might be nice, too.

Thanks for testing this, rick

> Anyway, many, many thanks for all the help, Rick.  I'll keep monitoring my
> BeagleBone Black, but it looks for now that this has solved the NFS
> "instability."
> 
> Cheers,
> 
> Paul.
> 
> 


More information about the freebsd-fs mailing list