NFS hangs (7.3)

Thu Nov 18 12:49:43 UTC 2010

> I've got a problem on a server farm. Every now and then,
> some NFS mounts hang. This happens after a few days or
> after a few weeks. All processes trying to access files
> from the hanging mount go to state "D" and freeze. The
> only way to resolve the problem is to reboot the server.
> 
> "umount -f" als hangs and does not remove the hanging
> mount (even though it disappears from the output of the
> mount(8) command).
> 
> Here's one example from an attempt to run df(1) which
> also hangs:
> 
> ps -uww:
> USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
> root 61930 0.0 0.0 5728 1280 p4- D 5:15PM 0:00.01 /bin/df
> 
> ps -lww:
> UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
> 0 61930 1 0 -4 0 5728 1280 nfs D p4- 0:00.01 /bin/df
> 
It would appear that the root vnode for the client mount
point is locked for some reason. Here are a couple of possible
explanations:
1 - An infrequently executed code path doesn't VOP_UNLOCK()/vput()
    as it should. This seems relatively unlikely, since others are
    using the client without difficulties, but it might be an error
    case that only shows up for your environment.
2 - Another thread is holding the lock while stuck waiting for something
    else. The most obvious "something else" would be an RPC reply from
    the server. (A locking deadlock as mentioned below w.r.t. the spawning
    of new nfsiod threads, could be another?)

I'd suggest a "ps axHl" when this happens, and then look for a thread that
is waiting for an RPC reply. I'd also suggest "nfsstat -c" done several
times over a few minutes, to see if any of the counts is changing.
Also, you can do "tcpdump -w xxx -s 0 host <nfs-server>" on the client
for a while and then look at "xxx" in wireshark (it knows NFS packets)
and see if there is any net traffic to/from the server. (This will tell
you if it is a problem related to an RPC that is in progress vs something
else.) It will also tell you if it is using TCP (or you can "netstat -a"
to see if TCP connections are there for the NFS mounts).

> 
> The machine is quite busy. The hangs seem to always occur
> in the night when lots of cron jobs are running. The machine
> has 221 NFS mounts and 26 nullfs mounts, and it has 26 jails,
> if that matters. All NFS shares are mounted from a virtual
> filer running on a NetApp filer. The mounts use the default
> settings, so they should be v3 TCP (this is the default,
> right?). The only extra option we use is -L in order to
> "fake" locking locally.
> 
> The machine is running FreeBSD 7.3-PRERELEASE-20100311 amd64.
> Updating is somewhat complicated in that server farm, so I
> haven't tried that so far because I'm not sure if it would
> help.
> 
I've only been working with 8/current, so I can't recall if
there have been any client fixes for 7 since then, except there
was a very recent change w.r.t. spawning of nfsiod threads to
avoid lor (potential deadlocks) related to creating new kernel
threads. I have no idea if one of these deadlocks might be involved.
(Someone familiar with that might be able to comment?)

rick