Still getting NFS client locking up

Mon Nov 10 09:44:01 PST 2003

It seems Robert Watson wrote:
> How fast are your systems, speaking of which?  I live in the world of
> 300-500 mhz machines at work, and 300-800 mhz boxes at home.  If you're
> using multi-ghz boxes, that could well be the distinguishing factor
> between our configurations...

Server is 533MhzVIA C3, clients everything from 300Mhz PII to 2.6G P4.

> Ok, here's the strategy I was planning to take once I could reproduce it:
> 
> (1) Attempt to further narrow down responsibility to client/server.  In
>     particular, see if an apparent hang on one client affects the other
>     clients. 

For me its just the server end that fails, I've not seen the client hang.

> (2) Investigate Soren's report that killing and restarting nfsd on the
>     server would clear the hang.

Yups, that works, in fact I have that in my crontab now every minute
to keep NFS from hosing my setup here.
NOTE: I also still need to ifconfig done/up my interfaces on some
boxes or the netstack will freeze (again done every minute in crontab).
However when NFS locks up it seems totatlly unrelated, ie all other 
network traffic works...

> (3) Look at stack traces of involved processes on both the client and
>     server: in particular, look at traces for any client blocked in NFS,
>     any nfsiod processes on the client, and the nfsd processes on the
>     server.  Also look at the wait channels on clients and servers for
>     these processes.  Particularly interested in whether nfsd processes
>     are blocked trying to grab locks.

Ok, will do..

> (4) Look at netstat information for NFS sockets, in particular, if the
>     buffers are full, or not being drained.  In particular, on the server,
>     is the input queue not being drained by nfsd worker threads? 

Netstat doesn't seem to give any hints or even usefull info here, 
any special cmdøs you want the output from ?

> (5) Try backing out src/sys/nfsserver/nfs_serv.c:1.137, which removed
>     another deadlock problem, but did change locking behavior in the NFS
>     server.

No change already tried.

> (6) Look at packet traces between the client and server with ethereal,
>     which has pretty good NFS decoding.  Is the client retransmitting an
>     RPC to the server and the server just isn't responding, or is the
>     client failing to transmit?  At the point of the hang, what sorts of
>     RPCs are outstanding to the server?  In the past, we've seen "apparent
>     hangs" when some or another more obscure unusual error case on the NFS
>     server fails to respond to an RPC, which causes the client to "wait
>     forever".

I can try that easily, I'll get a trace to you later tonight...

> Things to look for: normally, idle nfsd and nfsiod processes have a WCHAN
> of "-" (ps -lax), which indicates they're blocked waiting for some event
> to kick them off.  If you see nfsd processes "hung" in another state, it's
> a good sign we've identified a server problem.  In the nfs client
> processes, "nfsrcvlk" typically indicates a process has sent out an RPC
> and is now waiting on a response.

I see the idle '-' wchan here when things go bad IIRC...

-Søren