nfs send errors 32 and 35 on RELENG_4

Tue Jan 13 07:49:36 PST 2004

Hi,

For a while I have been seeing errors of this nature on a cluster of i386
FreeBSD RELENG_4 hosts which mount a volume from a NetApp F825 filer using
NFSv3 over a mixture of UDP and TCP, depending on whether the host is on the
same local LAN as the filer or not:

Jan 13 14:02:02 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: not responding
Jan 13 14:02:03 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: is alive again

The messages are logged with alarming regularity, but don't seem to actually
have any bearing on the performance or availablility of the volume. My full
findings are in my initial post to freebsd-net, which has been archived here:

http://www.freebsd.org/cgi/getmsg.cgi?fetch=178585+184466+/usr/local/www/db/text/2004/freebsd-net/20040111.freebsd-net

However more recently, and especially today, I am seeing errors which *are*
affecting the availability of the mount point on one of the hosts in question:

Jan 13 14:09:37 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:42 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:47 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:52 mese /kernel: nfs send error 35 for server 192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:53 mese /kernel: nfs send error 32 for server 192.168.1.1:/vol/vol1/claramail

We are running version 1.60.2.6 of nfs_socket.c, which is generating this
message. Looking at the CVS Web Repository, that seems to be the latest version
for RELENG_4.

A quick google suggests that error 32 is 'OK' in the sense that the TCP
connection should be reestablished and things can pick up where they left
off[1], but I can't find what causes error 35. 35 seems to be the more abundant
error, in any case.

The symptoms on the hosts when these errors occur are:

 - processes accessing files on the remote volume get stuck in disk wait,
   specifically their state is 'nfsrcv'.
 - even when all processes accessing volume are killed, and lsof shows no
   open files on the volume, "umount /vol" claims the device is busy.
 - a "umount -f" hangs and the umount process can't be killed.
 - however, after a "umount -f", /vol is not listed in "mount" or "df"
 - similarly, trying to then mount the volume, "mount" hangs and can't be
   killed, and the volume does not appear in "mount" or "df" (in fact, df
   hangs too. Presumably as it's trying to work out available space etc.)
 - a tcpdump between client and server doesn't show any NFS traffic at all
   being emitted by the client, although IP connectivity to the server is
   maintained, and other hosts are able to still talk NFS to it happily.

I tried to reboot the host in question to restore service, but it stayed
multi-user. The host was in a remote data centre so in the end it had to be
power cycled. The host wasn't on console so I wasn't able to determine why it
stayed multi-user.

I'm at a loss as to how to further debug this. It occurs to me that determining
what error 35 is would be helpful. :) I've looked in a book that I have
available[2], but it lists neither error 32 nor 35. Is there an up-to-date list
of NFSv3 errors anywhere? 

At this stage, any and all advice on where to look and what data I can usefully
retrieve that would help analyse this problem would be gratefully received.

Cheers,

Ollie

1: http://lists.freebsd.org/pipermail/freebsd-hackers/2003-July/001988.html
2: NFS Illustrated, Brent Callaghan, First Printing, ISBN 0-201-32570-5

-- 
Oliver Cook    Systems Administrator, Claranet UK
ollie at uk.clara.net               +44 20 7903 3065