Stale NFS file handles on 8.x amd64

Adam McDougall mcdouga9 at egr.msu.edu
Wed Dec 1 05:36:06 UTC 2010


On 11/30/10 08:33, Rick Macklem wrote:
>> I've been running dovecot 1.1 on FreeBSD 7.x for a while with a bare
>> minimum of NFS problems, but it got worse with 8.x. I have 2-4 servers
>> (usually just 2) accessing mail on a Netapp over NFSv3 via imapd.
>> delivery is via procmail which doesn't touch the dovecot metadata and
>> webmail uses imapd. Client connections to imapd go to random servers
>> and I don't yet have solid means to keep certain users on certain
>> servers. I upgraded some of the servers to 8.x and dovecot 1.2 and ran
>> into Stale NFS file handles causing index/uidlist corruption causing
>> inboxes to appear as empty when they were not. In some situations
>> their
>> corrupt index had to be deleted manually. I first suspected dovecot
>> 1.2
>> since it was upgraded at the same time but I downgraded to 1.1 and its
>> doing the same thing. I don't really have a wealth of details to go on
>> yet and I usually stay quiet until I do, and half the time it is
>> difficult to reproduce myself so I've had to put it in production to
>> get
>> a feel for progress. This only happens a dozen or so times per weekday
>> but I feel the need to start taking bigger steps. I'll probably do
>> what
>> I can to get IMAP back on a stable base (7.x?) and also try to debug
>> 8.x
>> on the remaining servers. A binary search is within possibility if I
>> can reproduce the symptoms often enough even if I have to put a test
>> server in production for a few hours.
>>
>> Any tips on where we could start looking, or alterations I could try
>> making such as sysctls to return to older behavior? It might be worth
>> noting that I've seen a considerable increase in traffic from my mail
>> servers since the 8.x upgrade timeframe, on the order of 5-10x as much
>> traffic to the NFS server. dovecot tries its hardest to flush out the
>> access cache when needed and it was working well enough since about
>> 1.0.16 (years ago). It seems like FreeBSD is what regressed in this
>> scenario. dovecot 2.x is going in a different direction from my
>> situation and I'm not ready to start testing that immediately if I can
>> avoid it as it will involve some restructuring.
>>
>> Thanks for any input. For now the following errors are about all I
>> have
>> to go on:
>>
>> Nov 29 11:07:54 server1 dovecot: IMAP(user1):
>> o_stream_send(/home/user1/Maildir/dovecot/private/control/.INBOX/dovecot-uidlist)
>> failed: Stale NFS file handle
>> Nov 29 13:19:51 server1 dovecot: IMAP(user1):
>> o_stream_send(/home/user1/Maildir/dovecot/private/control/.INBOX/dovecot-uidlist)
>> failed: Stale NFS file handle
>> Nov 29 14:35:41 server1 dovecot: IMAP(user2):
>> o_stream_send(/home/user2/Maildir/dovecot/private/control/.INBOX/dovecot-uidlist)
>> failed: Stale NFS file handle
>> Nov 29 15:07:05 server1 dovecot: IMAP(user3): read(mail, uid=128990)
>> failed: Stale NFS file handle
>>
>> Nov 29 11:57:22 server2 dovecot: IMAP(user4):
>> open(/egr/mail/shared/vprgs/dovecot-acl-list) failed: Stale NFS file
>> handle
>> Nov 29 14:04:22 server2 dovecot: IMAP(user5):
>> o_stream_send(/home/user5/Maildir/dovecot/private/control/.INBOX/dovecot-uidlist)
>> failed: Stale NFS file handle
>> Nov 29 14:27:21 server2 dovecot: IMAP(user6):
>> o_stream_send(/home/user6/Maildir/dovecot/private/control/.INBOX/dovecot-uidlist)
>> failed: Stale NFS file handle
>> Nov 29 15:44:38 server2 dovecot: IMAP(user7):
>> open(/egr/mail/shared/decs/dovecot-acl-list) failed: Stale NFS file
>> handle
>> Nov 29 19:04:54 server2 dovecot: IMAP(user8):
>> o_stream_send(/home/user8/Maildir/dovecot/private/control/.INBOX/dovecot-uidlist)
>> failed: Stale NFS file handle
>>
>> Nov 29 06:32:11 server3 dovecot: IMAP(user9):
>> open(/egr/mail/shared/cmsc/dovecot-acl-list) failed: Stale NFS file
>> handle
>> Nov 29 10:03:58 server3 dovecot: IMAP(user10):
>> o_stream_send(/home/user10/Maildir/dovecot/private/control/.INBOX/dovecot-uidlist)
>> failed: Stale NFS file handle
>>
> Others have made good suggestions. One more you could try is disabling the negative
> name caching by setting the option "negnametimeo=0". The addition of negative name
> caching is also in FreeBSD7, but it is a fairly recent change, so your FreeBSD7 boxes
> may not have had it. I also think trying the "dot-locking" and running without statd
> and lockd (you can mount with the "nolock" option) would be worth trying. And, of course,
> disabling attribute caching is mentioned on the web page others cited.
>
> Good luck with it, rick
> ps: Unfortunately the NFS protocol cannot support for POSIX file system semantics, so
>      some apps can never run correctly on NFS mounted volumes. NFSv4 comes closer, but
>      it still can't provide full POSIX semantics.
>

I'll give negnametimeo=0 a try on one server starting tonight, I'll be 
busy tomorrow and don't want to risk making anything potentially worse 
than it is yet.  I can't figure out how to disable the attr cache in 
FreeBSD.  Neither suggestions seem to be valid, and years ago when I 
looked into it I got the impression that you can't, but I'd love to be 
proven wrong.  I'll try dotlock when I can.  Would disabling statd and 
lockd be the same as using nolock on all mounts?  The vacation binary is 
the only thing I can think of that might use it, not sure how well it 
would like missing it which is how I discovered I needed it in the first 
place.  Also, if disabling lockd shows an improvement, could it lead to 
further investigation or is it just a workaround?  Just trying to 
understand the possibilities better.  I know ESTALE means the file 
vanished but for the files I had an error on, it is expected that 
multiple systems are going to spontaneously replace the file.  Thanks.


More information about the freebsd-stable mailing list