NFS deadlock (unkillable nfsd and no mounts work)

Josh Carroll josh.carroll at gmail.com
Fri Nov 5 06:01:30 UTC 2010


Greetings!

I'm having a problem with nfsd hanging and not serving mount points,
during which time it can not not be killed. This problem started
happening sometime after November 2nd, since kernel from 11/2 sources
does not exhibit this problem.

The current kernel I'm running is via SVN I just grabbed this evening
(around 5pm PDT on November 4th), but I was having the same problem
yesterday around 9pm PDT after a csup yesterday (I switched to SVN
today to rule out a stale /usr/src from an out of sync cvsup  mirror).
 Here are the svn details:

Path: /usr/src
URL: svn://svn.freebsd.org/base/stable/8
Repository Root: svn://svn.freebsd.org/base
Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
Revision: 214807
Node Kind: directory
Schedule: normal
Last Changed Author: jhb
Last Changed Rev: 214791
Last Changed Date: 2010-11-04 10:25:31 -0700 (Thu, 04 Nov 2010)

uname -a:

FreeBSD 8.1-STABLE FreeBSD 8.1-STABLE #0 r214807: Thu Nov  4 17:13:05
PDT 2010     root at pflog.net:/usr/obj/usr/src/sys/PFLOG  amd64

I have a Popcorn Hour, and as soon as I try to connect to my NFS mount
with it, it hangs on the Popcorn Hour, then eventually pops up a
message that says "Request cannot be processed". Likewise if I try to
mount it from my macbook, it hangs then later just says operation
timed out or something like that, after it hangs for quite a while.

During this hang, there is nothing in /var/log indicating a problem
nor any other indications something is wrong, except that none of my
NFS mounts work and the nfsd process will not die.

When I try to reboot the server, I wind up having to fsck all my
drives (except the ZFS one), since nfsd will not die. Even kill -9
doesn't kill it (it's showing as in the D state):

root 444 0.0 0.0 5812 1384 ?? D   9:30PM  0:00.00 nfsd: server (nfsd)

And if I try to /etc/rc.d/nfsd stop, it just says:

Stopping nfsd.
Waiting for PIDS: 444

And hangs there indefinitely. I tried to run a ktrace on both the
"nfsd: server" and "nfsd: master" processes (ktrace -i -d -f
nfsd_server.ktrace and ktrace -i -d -f nfsd_master.ktrace), but when I
try to connect to the NFS mount, ktrace doesn't capture anything and
the "nfsd: server" process goes to the "D" state and then I can't kill
it. If I try to kill the nfsd process BEFORE I attempt to mount
anything, it properly stops with /etc/rc.d/nfsd stop or with a kill
-TERM. Once I've tried to connect once, however, it can't be killed.

Hoping it was perhaps related to ZFS, I commented out the one ZFS
mount point in /etc/exports, but it still causes this deadlock in the
nfsd process. I even went as far as to comment everything in
/etc/exports and create a new export on a different disk, which did
not help, I get the same nfsd hang.

Another strange thing, if I try to truss on the "nfsd: server" process
(the child) before I try to mount anything, it causes the process to
exit immediately along with truss. If I look at what truss captured
for it, I see:

  411: sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0)
= 0 (0x0)
  411: sigprocmask(SIG_SETMASK,0x0,0x0)          = 0 (0x0)
  411: process exit, rval = 0

My kernel built from sources on 11/2 works fine, so it's something
that has changed sometime after November 2nd. At least, my kernel from
November 2nd runs fine and does not have this nfsd lockup problem.

My kernel is just GENERIC with a few additions:

include GENERIC

device      pf
device      pflog
device      coretemp
device      uchcom
device      sound
device      snd_hda
option      NETATALK
option      ALTQ
option      ALTQ_CBQ
option      ALTQ_HFSC
option      ALTQ_NOPCC
option      ALTQ_PRIQ
option      ALTQ_RED
option      ALTQ_RIO
option      COMPAT_LINUX32
option      GEOM_MIRROR
option      LIBICONV
option      LIBMCHAIN
option      NETSMB
option      NULLFS
option      SMBFS
option      UDF
nooption    INET6

If any other information is needed, please let me know. What are the
next things I should be doing to diagnose the problem? It seems
specific to nfsd, but I'm not sure how to prove it's that and not
something related or complimentary to nfsd. For what it's worth
rpcbind and mountd both stop fine, it's just the nfsd process that is
locking up.

Thanks in advance for any advice on troubleshooting or root-causing
the issue would be appreciated.

Regards,
Josh


More information about the freebsd-stable mailing list