[Bug 206855] NFS errors from ZFS backed file system when server under load
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Tue Feb 2 17:22:16 UTC 2016
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206855
Bug ID: 206855
Summary: NFS errors from ZFS backed file system when server
under load
Product: Base System
Version: 10.2-RELEASE
Hardware: Any
OS: Any
Status: New
Severity: Affects Some People
Priority: ---
Component: kern
Assignee: freebsd-bugs at FreeBSD.org
Reporter: vivek at khera.org
I posted a question about NFS errors (unable to read directories or files) when
the NFS server comes under high load, and at least two other people reported
that they observe the same types of failures. The thread is at
https://lists.freebsd.org/pipermail/freebsd-questions/2016-February/270292.html
This might be related to bug #132068
It seems that using NFS to share a ZFS data set is not so stable under high
load. Here's my original question/bug report:
I have a handful of servers at my data center all running FreeBSD 10.2. On one
of them I have a copy of the FreeBSD sources shared via NFS. When this server
is running a large poudriere run re-building all the ports I need, the clients'
NFS mounts become unstable. That is, the clients keep getting read failures.
The interactive performance of the NFS server is just fine, however. The local
file system is a ZFS mirror.
What could be causing NFS to be unstable in this situation?
Specifics:
Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with NFS server
and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad processor.
The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS exported
via the ZFS exports file. I put the FreeBSD sources on this dataset and symlink
to /usr/src.
Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, NFS client
built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor (basically same
hardware but more RAM).
The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS options
are "intr,nolockd". /usr/src is symlinked to the sources in that NFS mount.
What I observe:
[lorax]~% cd /usr/src
[lorax]src% svn status
[lorax]src% w
9:12AM up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61
USER TTY FROM LOGIN@ IDLE WHAT
vivek pts/0 vick.int.kcilink.com 8:44AM - tmux: client (/tmp/
vivek pts/1 tmux(19747).%0 8:44AM 19 sed y%*+%pp%;s%[^_a
vivek pts/2 tmux(19747).%1 8:56AM - w
vivek pts/3 tmux(19747).%2 8:56AM - slogin bluefish-prv
[lorax]src% pwd
/u/lorax1/usr10/src
So right now the load average is more than 1 per processor on lorax. I can
quite easily run "svn status" on the source directory, and the interactive
performance is pretty snappy for editing local files and navigating around the
file system.
On the client:
[bluefish]~% cd /usr/src
[bluefish]src% pwd
/n/lorax1/usr10/src
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/contrib/sqlite3':
Partial results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/lib/libfetch': Partial
results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory
'/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results are valid
but processing is incomplete
[bluefish]src% w
9:14AM up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15
USER TTY FROM LOGIN@ IDLE WHAT
vivek pts/0 lorax-prv.kcilink.com 8:56AM - w
[bluefish]src% df .
Filesystem 1K-blocks Used Avail Capacity Mounted on
lorax-prv:/u/lorax1 932845181 6090910 926754271 1% /n/lorax1
What I see is more or less random failures to read the NFS volume. When the
server is not so busy running poudriere builds, the client never has any
failures.
I also observe this kind of failure doing buildworld or installworld on the
client when the server is busy -- I get strange random failures reading the
files causing the build or install to fail.
My workaround is to not do build/installs on client machines when the NFS
server is busy doing large jobs like building all packages, but there is
definitely something wrong here I'd like to fix. I observe this on all the
local NFS clients. I rebooted the server before to try to clear this up but it
did not fix it.
My intuition is pointing to some sort of race condition with ZFS and NFS, but
digging deeper into that is well beyond my pay grade.
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-bugs
mailing list