NFS hangs (7.3)

Kostik Belousov kostikbel at gmail.com
Thu Nov 18 12:54:27 UTC 2010


On Thu, Nov 18, 2010 at 07:49:41AM -0500, Rick Macklem wrote:
> > I've got a problem on a server farm. Every now and then,
> > some NFS mounts hang. This happens after a few days or
> > after a few weeks. All processes trying to access files
> > from the hanging mount go to state "D" and freeze. The
> > only way to resolve the problem is to reboot the server.
> > 
> > "umount -f" als hangs and does not remove the hanging
> > mount (even though it disappears from the output of the
> > mount(8) command).
> > 
> > Here's one example from an attempt to run df(1) which
> > also hangs:
> > 
> > ps -uww:
> > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
> > root 61930 0.0 0.0 5728 1280 p4- D 5:15PM 0:00.01 /bin/df
> > 
> > ps -lww:
> > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
> > 0 61930 1 0 -4 0 5728 1280 nfs D p4- 0:00.01 /bin/df
> > 
> It would appear that the root vnode for the client mount
> point is locked for some reason. Here are a couple of possible
> explanations:
> 1 - An infrequently executed code path doesn't VOP_UNLOCK()/vput()
>     as it should. This seems relatively unlikely, since others are
>     using the client without difficulties, but it might be an error
>     case that only shows up for your environment.
> 2 - Another thread is holding the lock while stuck waiting for something
>     else. The most obvious "something else" would be an RPC reply from
>     the server. (A locking deadlock as mentioned below w.r.t. the spawning
>     of new nfsiod threads, could be another?)
> 
> I'd suggest a "ps axHl" when this happens, and then look for a thread that
> is waiting for an RPC reply. I'd also suggest "nfsstat -c" done several
> times over a few minutes, to see if any of the counts is changing.
> Also, you can do "tcpdump -w xxx -s 0 host <nfs-server>" on the client
> for a while and then look at "xxx" in wireshark (it knows NFS packets)
> and see if there is any net traffic to/from the server. (This will tell
> you if it is a problem related to an RPC that is in progress vs something
> else.) It will also tell you if it is using TCP (or you can "netstat -a"
> to see if TCP connections are there for the NFS mounts).
> 
> > 
> > The machine is quite busy. The hangs seem to always occur
> > in the night when lots of cron jobs are running. The machine
> > has 221 NFS mounts and 26 nullfs mounts, and it has 26 jails,
> > if that matters. All NFS shares are mounted from a virtual
> > filer running on a NetApp filer. The mounts use the default
> > settings, so they should be v3 TCP (this is the default,
> > right?). The only extra option we use is -L in order to
> > "fake" locking locally.
> > 
> > The machine is running FreeBSD 7.3-PRERELEASE-20100311 amd64.
> > Updating is somewhat complicated in that server farm, so I
> > haven't tried that so far because I'm not sure if it would
> > help.
> > 
> I've only been working with 8/current, so I can't recall if
> there have been any client fixes for 7 since then, except there
> was a very recent change w.r.t. spawning of nfsiod threads to
> avoid lor (potential deadlocks) related to creating new kernel
> threads. I have no idea if one of these deadlocks might be involved.
> (Someone familiar with that might be able to comment?)

The changes for nfsiod creation are definitely not in 7.3-prerelease.

To diagnose the issue, we could start with the output of ps axlHww
(already suggested by Rick) and procstat -ka.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20101118/661578a6/attachment.pgp


More information about the freebsd-fs mailing list