NFS unstable with high load on server

Tue Feb 2 14:53:36 UTC 2016

On 02/01/16 08:32, Vick Khera wrote:
> I have a handful of servers at my data center all running FreeBSD 10.2. On
> one of them I have a copy of the FreeBSD sources shared via NFS. When this
> server is running a large poudriere run re-building all the ports I need,
> the clients' NFS mounts become unstable. That is, the clients keep getting
> read failures. The interactive performance of the NFS server is just fine,
> however. The local file system is a ZFS mirror.
>
> What could be causing NFS to be unstable in this situation?
>
> Specifics:
>
> Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with NFS
> server and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad
> processor.
>
> The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS
> exported via the ZFS exports file. I put the FreeBSD sources on this
> dataset and symlink to /usr/src.
>
>
> Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, NFS
> client built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor (basically
> same hardware but more RAM).
>
> The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS
> options are "intr,nolockd". /usr/src is symlinked to the sources in that
> NFS mount.
>
>
> What I observe:
>
> [lorax]~% cd /usr/src
> [lorax]src% svn status
> [lorax]src% w
>   9:12AM  up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61
> USER       TTY      FROM                      LOGIN@  IDLE WHAT
> vivek      pts/0    vick.int.kcilink.com      8:44AM     - tmux: client
> (/tmp/
> vivek      pts/1    tmux(19747).%0            8:44AM    19 sed
> y%*+%pp%;s%[^_a
> vivek      pts/2    tmux(19747).%1            8:56AM     - w
> vivek      pts/3    tmux(19747).%2            8:56AM     - slogin
> bluefish-prv
> [lorax]src% pwd
> /u/lorax1/usr10/src
>
> So right now the load average is more than 1 per processor on lorax. I can
> quite easily run "svn status" on the source directory, and the interactive
> performance is pretty snappy for editing local files and navigating around
> the file system.
>
>
> On the client:
>
> [bluefish]~% cd /usr/src
> [bluefish]src% pwd
> /n/lorax1/usr10/src
> [bluefish]src% svn status
> svn: E070008: Can't read directory '/n/lorax1/usr10/src/contrib/sqlite3':
> Partial results are valid but processing is incomplete
> [bluefish]src% svn status
> svn: E070008: Can't read directory '/n/lorax1/usr10/src/lib/libfetch':
> Partial results are valid but processing is incomplete
> [bluefish]src% svn status
> svn: E070008: Can't read directory
> '/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results are
> valid but processing is incomplete
> [bluefish]src% w
>   9:14AM  up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15
> USER       TTY      FROM                      LOGIN@  IDLE WHAT
> vivek      pts/0    lorax-prv.kcilink.com     8:56AM     - w
> [bluefish]src% df .
> Filesystem          1K-blocks    Used     Avail Capacity  Mounted on
> lorax-prv:/u/lorax1 932845181 6090910 926754271     1%    /n/lorax1
>
>
> What I see is more or less random failures to read the NFS volume. When the
> server is not so busy running poudriere builds, the client never has any
> failures.
>
> I also observe this kind of failure doing  buildworld or installworld on
> the client when the server is busy -- I get strange random failures reading
> the files causing the build or install to fail.
>
> My workaround is to not do build/installs on client machines when the NFS
> server is busy doing large jobs like building all packages, but there is
> definitely something wrong here I'd like to fix. I observe this on all the
> local NFS clients. I rebooted the server before to try to clear this up but
> it did not fix it.
>
> Any help would be appreciated.
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscribe at freebsd.org"
>

I notice similar issues. I have some in-house code that I 
native-recompile nightly under a FreeBSD 9.3R box running a 
8-HDD-unmirrored-pool-ZFS & across the LAN on a Linux box w/ the Intel 
compiler suite. I also do across-the-LAN nightly backups. The 
across-the-LAN compiles & backups both use NFS to access other boxen on 
the LAN, notably including the dev-box doing the native-compile under 
FreeBSD 9.3R. If these processes overlap (too much/at all), the native 
compiles often fail and/or the backups barf. I thought it might be a 
NFS/ZFS issue. Earlier boxen didn't use ZFS, they were Linux or SGI 
(snif), & they bogged down (mightily) if the processes overlapped, but 
they did finish cleanly.

-- 

	William A. Mahaffey III

  ----------------------------------------------------------------------

	"The M1 Garand is without doubt the finest implement of war
	 ever devised by man."
                            -- Gen. George S. Patton Jr.