NFSv3 + 8.1 + rpc.[lockd|statd] issues

Wed Oct 13 13:07:24 UTC 2010

Hey folks,

We have a machine running FreeBSD 8.1-RELEASEp1 acting as an NFS server hosting 3 ZFS file systems on an external enclosure. There are a bunch of machines, ranging from 4.11, 7.1, and 8.x systems acting as NFS clients to this server.  Running dmesg on the NFS server shows no errors at all, but the three different clients show differing errors.  On the panther example below, it was reported last night a 48MB file took about 90 minutes to transfer.

I'm working on upgrading the 7.1 system to 8.1 now, so I'm not quite as concerned with that, but the rpcbind errors that show on both 7.1 and 8.1 are causing core dumps on some of our applications.

Any help is appreciated.

=== Data below ===

ecrist at jaguar-1:~-> date
Wed Oct 13 07:33:46 CDT 2010
ecrist at jaguar-1:~-> dmesg
ecrist at jaguar-1:~-> uname -a
FreeBSD jaguar-1.claimlynx.com 8.1-RC2 FreeBSD 8.1-RC2 #1: Wed Jul 14 11:34:02 CDT 2010     root at jaguar-1.claimlynx.com:/usr/obj/usr/src/sys/GENERIC-CARP  amd64
ecrist at jaguar-1:~-> uptime
 7:33AM  up 83 days,  8:42, 2 users, load averages: 1.08, 1.25, 0.94
ecrist at jaguar-1:~->

On the clients, however, many of them are reporting assorted problems.  The 7.1 system reports the following:

ecrist at panther:~-> date
Wed Oct 13 07:34:43 CDT 2010
ecrist at panther:~-> dmesg
...
Can't start NLM - unable to contact NSM
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
nfs server jaguar.stor:/array/production: not responding
nfs server jaguar.stor:/array/production: is alive again
nfs server jaguar.stor:/array/production: not responding
...
ecrist at panther:~-> uname -a
FreeBSD panther.claimlynx.com 7.1-RELEASE-p3 FreeBSD 7.1-RELEASE-p3 #2: Sun Mar 22 08:21:50 CDT 2009     root at cougar.claimlynx.com:/usr/obj/usr/src/sys/SMP-ASR  i386
ecrist at panther:~-> uptime
 7:34AM  up 30 days, 16:13, 4 users, load averages: 0.97, 1.00, 0.91
ecrist at panther:~-> 

Our 4.11 system:
ecrist at puma:~-> date
Wed Oct 13 07:38:09 CDT 2010
ecrist at puma:~-> dmesg
got bad cookie vp 0xe93fd240 bp 0xcfa2d2ec
got bad cookie vp 0xe859e740 bp 0xcfa96644
...
nfs server jaguar.stor:/array/production: not responding
nfs server jaguar.stor:/array/production: is alive again
nfs server jaguar.stor:/array/archive: not responding
nfs server jaguar.stor:/array/archive: is alive again
nfs server jaguar.stor:/array/archive: not responding
nfs server jaguar.stor:/array/archive: is alive again
nfs server jaguar.stor:/array/archive: not responding
nfs server jaguar.stor:/array/production: not responding
nfs server jaguar.stor:/array/archive: is alive again
nfs server jaguar.stor:/array/production: is alive again
nfs server jaguar.stor:/array/archive: not responding
nfs server jaguar.stor:/array/archive: is alive again
nfs server jaguar.stor:/array/production: not responding
nfs server jaguar.stor:/array/production: is alive again
nfs server jaguar.stor:/array/production: not responding
...
ecrist at puma:~-> uname -a
FreeBSD puma.claimlynx.com 4.11-RELEASE-p2 FreeBSD 4.11-RELEASE-p2 #1: Wed Apr 13 18:25:25 CDT 2005     drue at puma.claimlynx.com:/usr/obj/usr/src/sys/PUMA  i386
ecrist at puma:~-> uptime
 7:38AM  up 30 days, 15:27, 1 user, load averages: 0.02, 0.02, 0.00
ecrist at puma:~-> 

And, finally, an 8.1 system:

ecrist at puma-2:~-> date
Wed Oct 13 07:39:27 CDT 2010
ecrist at puma-2:~-> dmesg
...
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
ecrist at puma-2:~-> uname -a
FreeBSD puma-2.claimlynx.com 8.1-RELEASE FreeBSD 8.1-RELEASE #2: Mon Aug  2 12:50:40 CDT 2010     root at jaguar-1.claimlynx.com:/usr/obj/usr/src/sys/GENERIC-CARP  amd64
ecrist at puma-2:~-> uptime
 7:39AM  up 70 days, 18:25, 3 users, load averages: 0.00, 0.00, 0.00
ecrist at puma-2:~->