Very weird NFS-related hang in 6-beta5.

Frank Mayhar frank at exit.com
Sat Oct 1 22:16:39 PDT 2005


I mount my /usr/ports, /usr/src, et al from an NFS server.  Everything
seems to work fine except on one system where I've been seeing repeated
hangs.  Of course the system in question is my main desktop one, sigh.

At first I was using gigabit Ethernet (Intel Pro/1000, 82545GM chipset)
but the interface kept wedging hard, also on this system (and _not_ on
the server, just on this one).  I upgrading the system to 6.0-beta5 to
see if the interface hangs went away.  (I upgraded by
NFS-mounting /usr/src over my parallel 100BaseTX network rather than the
Gigabit network.)  The upgrade worked fine but the hangs didn't
disappear.  I planned to swap out the gigabit card to see if it was the
hardware that was the problem, but in the interim (not having a spare
card lying around) I decided to do a complete portupgrade using the
100BaseTX network.

This is where it gets weird.  Because of all the hangs I've run into, at
some point I made all the NFS mounts soft mounts.  I've been watching
these port builds, and from time to time, with no obvious pattern that
can discern, NFS hangs.  The server seems perfectly healthy and in fact
the _interface_ seems healthy, but the particular I/O in question just
hangs until it eventually times out due to the soft-mount.

After it finally times out, things pick up and keep going again.  NFS
works fine for a while, then it hangs again.

I captured one of the hangs; this is from the client machine:

16:17:53.642822 IP realtime.exit.com.560259720 > jill.exit.com.nfs: 132
read fh 1070,983185/1114384 8192 bytes @ 1925120
16:17:53.643541 IP jill.exit.com.nfs > realtime.exit.com.560259720:
reply ok 1472 read
16:18:11.679433 IP realtime.exit.com.560259720 > jill.exit.com.nfs: 132
read fh 1070,983185/1114384 8192 bytes @ 1925120
16:18:11.680142 IP jill.exit.com.nfs > realtime.exit.com.560259720:
reply ok 1472 read

So the server gets the read and replies, but the client apparently never
sees the reply (despite the fact that it is coming in on the interface
and gets picked up by tcpdump).

I've attached the dmesg from the client, if it helps, but I doubt it
will.  I can't imagine that this is hardware, although I guess it
_might_ be.  It's just very weird.  Any hints as to cause or further
steps I can take to diagnose it would be appreciated.
-- 
Frank Mayhar frank at exit.com     http://www.exit.com/
Exit Consulting                 http://www.gpsclock.com/
                                http://www.exit.com/blog/frank/


More information about the freebsd-hackers mailing list