Unstable NFS on recent CURRENT

Paul Mather paul at gromit.dlib.vt.edu
Wed Mar 9 16:12:44 UTC 2016


On Mar 8, 2016, at 7:49 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Paul Mather wrote:
>> On Mar 7, 2016, at 9:55 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
>> 
>>> Paul Mather (forwarded by Ronald Klop) wrote:
>>>> On Sun, 06 Mar 2016 02:57:03 +0100, Paul Mather <paul at gromit.dlib.vt.edu>
>>>> wrote:
>>>> 
>>>>> On my BeagleBone Black running 11-CURRENT (r296162) lately I have been
>>>>> having trouble with NFS.  I have been doing a buildworld and buildkernel
>>>>> with /usr/src and /usr/obj mounted via NFS.  Recently, this process has
>>>>> resulted in the buildworld failing at some point, with a variety of
>>>>> errors (Segmentation fault; Permission denied; etc.).  Even a "ls -alR"
>>>>> of /usr/src doesn't manage to complete.  It errors out thus:
>>>>> 
>>>>> =====
>>>>> [[...]]
>>>>> total 0
>>>>> ls: ./.svn/pristine/fe: Permission denied
>>>>> 
>>>>> ./.svn/pristine/ff:
>>>>> total 0
>>>>> ls: ./.svn/pristine/ff: Permission denied
>>>>> ls: fts_read: Permission denied
>>>>> =====
>>>>> 
>>>>> On the console, I get the following:
>>>>> 
>>>>> newnfs: server 'chumby.chumby.lan' error: fileid changed. fsid
>>>>> 94790777:a4385de: expected fileid 0x4, got 0x2. (BROKEN NFS SERVER OR
>>>>> MIDDLEWARE)
>>>>> 
> Oh, I had forgotten this. Here's the comment related to this error.
> (about line#445 in sys/fs/nfsclient/nfs_clport.c):
> 446                      * BROKEN NFS SERVER OR MIDDLEWARE
> 447 	                 *
> 448 	                 * Certain NFS servers (certain old proprietary filers ca.
> 449 	                 * 2006) or broken middleboxes (e.g. WAN accelerator products)
> 450 	                 * will respond to GETATTR requests with results for a
> 451 	                 * different fileid.
> 452 	                 *
> 453 	                 * The WAN accelerator we've observed not only serves stale
> 454 	                 * cache results for a given file, it also occasionally serves
> 455 	                 * results for wholly different files.  This causes surprising
> 456 	                 * problems; for example the cached size attribute of a file
> 457 	                 * may truncate down and then back up, resulting in zero
> 458 	                 * regions in file contents read by applications.  We observed
> 459 	                 * this reliably with Clang and .c files during parallel build.
> 460 	                 * A pcap revealed packet fragmentation and GETATTR RPC
> 461 	                 * responses with wholly wrong fileids.
> 
> If you can connect the client->server with a simple switch (or just an RJ45 cable), it
> might be worth testing that way. (I don't recall the name of the middleware product, but
> I think it was shipped by one of the major switch vendors. I also don't know if the product
> supports NFSv4?)
> 
> rick


Currently, the client is connected to the server via a dumb gigabit switch, so it is already fairly direct.

As for the above error, it appeared on the console only once.  (Sorry if I made it sound like it appears every time.)

I just tried another buildworld attempt via NFS and it failed again.  This time, I get this on the BeagleBone Black console:

	nfs_getpages: error 13
	vm_fault: pager read error, pid 5401 (install)


The other thing I have noticed is that if I induce heavy load on the NFS server---e.g., by starting a Poudriere bulk build---then that provokes the client to crash much more readily.  For example, I started a NFS buildworld on the BeagleBone Black, and it seemed to be chugging along nicely.  The moment I kicked off a Poudriere build update of my packages on the NFS server, it crashed the buildworld on the NFS client.

I have had problems with swap on FreeBSD/arm before.  Swapping to a file does not appear to work for me.  As a result, I switched to swapping to a partition on the SD card.  Maybe this is unreliable, too?

Cheers,

Paul.



More information about the freebsd-fs mailing list