NFS unstable with high load on server
Vick Khera
vivek at khera.org
Mon Feb 1 14:25:54 UTC 2016
I have a handful of servers at my data center all running FreeBSD 10.2. On
one of them I have a copy of the FreeBSD sources shared via NFS. When this
server is running a large poudriere run re-building all the ports I need,
the clients' NFS mounts become unstable. That is, the clients keep getting
read failures. The interactive performance of the NFS server is just fine,
however. The local file system is a ZFS mirror.
What could be causing NFS to be unstable in this situation?
Specifics:
Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with NFS
server and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad
processor.
The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS
exported via the ZFS exports file. I put the FreeBSD sources on this
dataset and symlink to /usr/src.
Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, NFS
client built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor (basically
same hardware but more RAM).
The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS
options are "intr,nolockd". /usr/src is symlinked to the sources in that
NFS mount.
What I observe:
[lorax]~% cd /usr/src
[lorax]src% svn status
[lorax]src% w
9:12AM up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61
USER TTY FROM LOGIN@ IDLE WHAT
vivek pts/0 vick.int.kcilink.com 8:44AM - tmux: client
(/tmp/
vivek pts/1 tmux(19747).%0 8:44AM 19 sed
y%*+%pp%;s%[^_a
vivek pts/2 tmux(19747).%1 8:56AM - w
vivek pts/3 tmux(19747).%2 8:56AM - slogin
bluefish-prv
[lorax]src% pwd
/u/lorax1/usr10/src
So right now the load average is more than 1 per processor on lorax. I can
quite easily run "svn status" on the source directory, and the interactive
performance is pretty snappy for editing local files and navigating around
the file system.
On the client:
[bluefish]~% cd /usr/src
[bluefish]src% pwd
/n/lorax1/usr10/src
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/contrib/sqlite3':
Partial results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/lib/libfetch':
Partial results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory
'/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results are
valid but processing is incomplete
[bluefish]src% w
9:14AM up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15
USER TTY FROM LOGIN@ IDLE WHAT
vivek pts/0 lorax-prv.kcilink.com 8:56AM - w
[bluefish]src% df .
Filesystem 1K-blocks Used Avail Capacity Mounted on
lorax-prv:/u/lorax1 932845181 6090910 926754271 1% /n/lorax1
What I see is more or less random failures to read the NFS volume. When the
server is not so busy running poudriere builds, the client never has any
failures.
I also observe this kind of failure doing buildworld or installworld on
the client when the server is busy -- I get strange random failures reading
the files causing the build or install to fail.
My workaround is to not do build/installs on client machines when the NFS
server is busy doing large jobs like building all packages, but there is
definitely something wrong here I'd like to fix. I observe this on all the
local NFS clients. I rebooted the server before to try to clear this up but
it did not fix it.
Any help would be appreciated.
More information about the freebsd-questions
mailing list