Stability issues after upgrading to 7.1 - NFS related?

Brian DeFreitas briandef at rescomp.berkeley.edu
Sat Jul 18 07:11:34 UTC 2009


Hello all,

We recently upgraded an NFS server from 7.0-p6 to 7.1-p6.  The following
Monday morning, we found the server's networking to be wedged, and
console error messages that strongly resemble this post [1].

In an effort to try the mentioned fixes, we upgraded to 7-STABLE. This
did not seem to help matters; the NFS server keeps wedging 1-2x a
day, requiring soft reboots (via console) at times and hard reboots at
others. Heavy NFS load seems to trigger everything.

Initially, we thought there might be a problem with rpc.statd because
we started seeing "RPC: Port mapper failure - RPC : Timed out" messages.
All the hosts that timed out were previously-working Linux (CentOS) NFS
clients.

We have IPsec configured in transport mode between all FreeBSD and Linux
NFS clients, but only see the RPC error for CentOS (not RHEL) hosts,
(and no errors from FreeBSD clients). Before the system wedges
completely, `top` reports that most nfsd processes are in the *ipsec
state.

These are all the troubleshooting steps we have taken:

    - disabled NFS locking on the Linux NFS clients
        - RPC timed out messages still appear

    - set up RPC to use static ports for NFS on our CentOS clients
      (to work better with our firewalls, which needed no such
      rules before)
        - RPC timed out messages still appear

    - added 'rpc_lockd_enable="NO"' to /etc/rc.conf
        - after rebooting, `rpcinfo -p` showed no lock manager running,
	  but the crashes persisted

    - added "nooptions NFSLOCKD" to the kernel configuration
        - this only caused things to crash faster (few minutes after
	  boot, with very little NFS load)

Unfortunately, one of the issues we've run into in debugging this
problem is the lack of useful logs and debugging information. Some info
we have managed to gather:

    - before one reboot, we noticed console messages about mbuf's
      filling up.  Running `netstat -m` right before crashes seems to
      confirm this.

If anyone could provide some insight into what's happening, or help
us get more debugging information, it would be very helpful.

[1] http://lists.freebsd.org/pipermail/freebsd-current/2009-May/006434.html

-- 
Brian DeFreitas
Lead Unix Systems Administrator
Network Infrastructure, RSSP-IT
UC Berkeley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20090718/39674afd/attachment.pgp


More information about the freebsd-questions mailing list