Unstable NFS on recent CURRENT

Thu Mar 10 14:29:29 UTC 2016

On Mar 9, 2016, at 8:59 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Paul Mather wrote:
>> On Mar 8, 2016, at 7:49 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
>> 
>>> Paul Mather wrote:
>>>> On Mar 7, 2016, at 9:55 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
>>>> 
>>>>> Paul Mather (forwarded by Ronald Klop) wrote:
>>>>>> On Sun, 06 Mar 2016 02:57:03 +0100, Paul Mather
>>>>>> <paul at gromit.dlib.vt.edu>
>>>>>> wrote:
>>>>>> 
>>>>>>> On my BeagleBone Black running 11-CURRENT (r296162) lately I have been
>>>>>>> having trouble with NFS.  I have been doing a buildworld and
>>>>>>> buildkernel
>>>>>>> with /usr/src and /usr/obj mounted via NFS.  Recently, this process has
>>>>>>> resulted in the buildworld failing at some point, with a variety of
>>>>>>> errors (Segmentation fault; Permission denied; etc.).  Even a "ls -alR"
>>>>>>> of /usr/src doesn't manage to complete.  It errors out thus:
>>>>>>> 
>>>>>>> =====
>>>>>>> [[...]]
>>>>>>> total 0
>>>>>>> ls: ./.svn/pristine/fe: Permission denied
>>>>>>> 
>>>>>>> ./.svn/pristine/ff:
>>>>>>> total 0
>>>>>>> ls: ./.svn/pristine/ff: Permission denied
>>>>>>> ls: fts_read: Permission denied
>>>>>>> =====
>>>>>>> 
>>>>>>> On the console, I get the following:
>>>>>>> 
>>>>>>> newnfs: server 'chumby.chumby.lan' error: fileid changed. fsid
>>>>>>> 94790777:a4385de: expected fileid 0x4, got 0x2. (BROKEN NFS SERVER OR
>>>>>>> MIDDLEWARE)
>>>>>>> 
>>> Oh, I had forgotten this. Here's the comment related to this error.
>>> (about line#445 in sys/fs/nfsclient/nfs_clport.c):
>>> 446                      * BROKEN NFS SERVER OR MIDDLEWARE
>>> 447 	                 *
>>> 448 	                 * Certain NFS servers (certain old proprietary filers
>>> ca.
>>> 449 	                 * 2006) or broken middleboxes (e.g. WAN accelerator
>>> products)
>>> 450 	                 * will respond to GETATTR requests with results for a
>>> 451 	                 * different fileid.
>>> 452 	                 *
>>> 453 	                 * The WAN accelerator we've observed not only serves
>>> stale
>>> 454 	                 * cache results for a given file, it also
>>> occasionally serves
>>> 455 	                 * results for wholly different files.  This causes
>>> surprising
>>> 456 	                 * problems; for example the cached size attribute of
>>> a file
>>> 457 	                 * may truncate down and then back up, resulting in
>>> zero
>>> 458 	                 * regions in file contents read by applications.  We
>>> observed
>>> 459 	                 * this reliably with Clang and .c files during
>>> parallel build.
>>> 460 	                 * A pcap revealed packet fragmentation and GETATTR
>>> RPC
>>> 461 	                 * responses with wholly wrong fileids.
>>> 
>>> If you can connect the client->server with a simple switch (or just an RJ45
>>> cable), it
>>> might be worth testing that way. (I don't recall the name of the middleware
>>> product, but
>>> I think it was shipped by one of the major switch vendors. I also don't
>>> know if the product
>>> supports NFSv4?)
>>> 
>>> rick
>> 
>> 
>> Currently, the client is connected to the server via a dumb gigabit switch,
>> so it is already fairly direct.
>> 
>> As for the above error, it appeared on the console only once.  (Sorry if I
>> made it sound like it appears every time.)
>> 
>> I just tried another buildworld attempt via NFS and it failed again.  This
>> time, I get this on the BeagleBone Black console:
>> 
>> 	nfs_getpages: error 13
>> 	vm_fault: pager read error, pid 5401 (install)
>> 
> 13 is EACCES and could be caused by what I mention below. (Any mount of a file
> system on the server unless "-S" is specified as a flag for mountd.)
> 
>> 
>> The other thing I have noticed is that if I induce heavy load on the NFS
>> server---e.g., by starting a Poudriere bulk build---then that provokes the
>> client to crash much more readily.  For example, I started a NFS buildworld
>> on the BeagleBone Black, and it seemed to be chugging along nicely.  The
>> moment I kicked off a Poudriere build update of my packages on the NFS
>> server, it crashed the buildworld on the NFS client.
>> 
> Try adding "-S" to mountd_flags on the server. Any time file systems are mounted
> (and Poudriere likes to do that, I am told), mount sends a SIGHUP to mountd to
> reload /etc/exports. When /etc/exports are being reloaded, there will be access
> errors for mounts (that are temporarily not exported) unless you specify "-S"
> (which makes mountd suspend the nfsd threads during the reload of /etc/exports).
> 
> rick

Bingo!  I think we may have a winner.  I added that flag to mountd_flags on the server and the "instability" appears to have gone away.

It may be that all along the NFS problems on the client just coincided with Poudriere runs on the server.  I build custom packages for my local machines using Poudriere so I use it quite a lot.  Maybe the Poudriere port should come with a warning at install to those using NFS that it may provoke disruption and suggest the addition of "-S"?  (Alternatively, maybe "-S" could become a default for mountd_flags?  Is there a downside from using it that means making it a default option is unsuitable?)

Anyway, many, many thanks for all the help, Rick.  I'll keep monitoring my BeagleBone Black, but it looks for now that this has solved the NFS "instability."

Cheers,

Paul.