nfsd stuck in *rc_lock state

Mon Jan 10 21:06:56 UTC 2011

> Hello Rick,
> 
> Am 11.11.2010 23:54, schrieb Rick Macklem:
> > That patch is "self contained", so I think it should be fine to
> > apply it
> > to an 8.0 server.
> >
> > You might also want
> >     http://people.freebsd.org/~rmacklem/freebsd8.0-patches/freebsd8-svc-mbufleak.patch
> > which plugged an mbuf leak in the regular FreeBSD8.0 server.
> >
> > Good luck with it, rick
> 
> the patch fixes the 100% cpu utilization, but we now had two times the
> issue, that all boxes lost connection to the nfs server (/home not
> responding), but nfsd was at about 1%.
> 
> Top did not show a strange behaviour here:
> 
> 
> PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> 703 root 55 0 4772K 1384K RUN 5 329:12 1.37%
> {nfsd: service}
> 703 root 56 0 4772K 1384K rpcsvc 0 326:41 0.59%
> {nfsd: service}
> 703 root 52 0 4772K 1384K rpcsvc 6 326:28 0.29%
> {nfsd: service}
> 703 root 60 0 4772K 1384K rpcsvc 5 328:42 0.00%
> {nfsd: master}
> 703 root 54 0 4772K 1384K rpcsvc 0 327:44 0.00%
> {nfsd: service}
> 703 root 53 0 4772K 1384K rpcsvc 1 327:37 0.00%
> {nfsd: service}
> 703 root 54 0 4772K 1384K rpcsvc 6 326:51 0.00%
> {nfsd: service}
> 703 root 57 0 4772K 1384K rpcsvc 2 326:44 0.00%
> {nfsd: service}
> 703 root 50 0 4772K 1384K rpcsvc 1 326:20 0.00%
> {nfsd: service}
> 703 root 71 0 4772K 1384K rpcsvc 2 323:11 0.00%
> {nfsd: service}
> 703 root 47 0 4772K 1384K rpcsvc 7 321:11 0.00%
> {nfsd: service}
> 703 root 46 0 4772K 1384K tx->tx 2 320:00 0.00%
> {nfsd: service}
> 
> there was nothing special in the logfiles, too.
> How to debug such a situation?
> 
First off, I hope you don't mind me adding the mailing
list as a cc. I'd like this stuff captured in the archive
for others to see. (If people don't like the noise, I'll
take the heat:-)

Ok, I'm sure others have better techniques, but here's how
I would start trying to resolve the above, done when the
server is stuck.
1 - Make sure the network is still functioning for other
    things like ssh.
2 - Do a "ps axHlww" and look at all the nfsd threads. I
    am primarily interested in the MWCHAN field.
    If it is:
    rpcsvc - the thread is just waiting for an RPC-->normal
    ufs or zfs - waiting for a vnode lock on the underlying
        file system
    anything else - I need to look in the kernel sources for
        the "sleep" with that argument.
    If I can't easily explain what all the nfsd threads are
    waiting for, wading through a "procstat -ka" is my next
    step. (I find this rather painful, so I tend to delay doing
    this as long as possible.:-)
3 - Do a "nfsstat -s" repeatedly and see if any of the counters
    are increasing.
4 - Fire up a "tcpdump" and see if there is any NFS traffic.
    (If there is, I'll capture it and put it in wireshark.)
5 - Do a "vmstat -z | fgrep mbuf" and look at the mbuf allocation.
    (If the machine is running out of mbufs, all sorts of quirky
     behaviour is possible.)

What top shows above isn't much, although I'd wonder what mbuf
usage looks like? If you haven't applied the patch mentioned
in the above message, you should do that.

I don't know if this helps, but... rick