FreeBSD NFS client goes into infinite retry loop
Steve Polyack
korvus at comcast.net
Sat Mar 20 02:41:03 UTC 2010
On 3/19/2010 9:32 PM, Rick Macklem wrote:
>
> On Fri, 19 Mar 2010, Steve Polyack wrote:
>
>>
>> To anyone who is interested: I did some poking around with DTrace,
>> which led me to the nfsiod client code.
>> In src/sys/nfsclient/nfs_nfsiod.c:
>> } else {
>> if (bp->b_iocmd == BIO_READ)
>> (void) nfs_doio(bp->b_vp, bp, bp->b_rcred, NULL);
>> else
>> (void) nfs_doio(bp->b_vp, bp, bp->b_wcred, NULL);
>> }
>>
>
> If you look t nfs_doio(), it decides whether or not to mark the buffer
> invalid, based on the return value it gets. Some (EINTR, ETIMEDOUT, EIO)
> are not considered fatal, but the others are. (When the async I/O
> daemons call nfs_doio(), they are threads that couldn't care less if
> the underlying I/O op succeeded. The outcome of the I/O operation
> determines what nfs_doio() does with the buffer cache block.)
I was looking at this and noticed the above after my last post.
>>
>> The result is that my problematic repeatable circumstance begins
>> logging "nfssvc_iod: iod 0 nfs_doio returned errno: 5" (corresponding
>> to NFSERR_INVAL?) for each repetition of the failed write. The only
>> things triggering this are my failed writes. I can also see the
>> nfsiod0 process waking up each iteration.
>>
>
> Nope, errno 5 is EIO and that's where the problem is. I don't know why
> the server is returning EIO after the file has been deleted on the
> server (I assume you did that when running your little shell script?).
Yes, while running the simple shell script I simply deleted the file on
the NFS server itself.
>> Do we need some kind of "retry x times then abort" logic within
>> nfsiod_iod(), or does this belong in the subsequent functions, such
>> as nfs_doio()? I think it's best to avoid these sorts of infinite
>> loops which have the potential to take out the system or overload the
>> network due to dumb decisions made by unprivileged users.
>>
> Nope, people don't like data not getting written back to a server when
> it is slow or temporarily network partitioned. The only thing that should
> stop a client from retrying a write back to the server is a fatal error
> from the server that says "this won't ever succeed".
>
> I think we need to figure out if the EIO (NFS3ERR_IO in wireshark) or
> if the server is sending NFS3ERR_STALE and the client is somehow munging
> that into EIO, causing the confusion.
This makes sense. According to wireshark, the server is indeed
transmitting "Status: NFS3ERR_IO (5)". Perhaps this should be STALE
instead; it sounds more correct than marking it a general IO error.
Also, the NFS server is serving its share off of a ZFS filesystem, if it
makes any difference. I suppose ZFS could be talking to the NFS server
threads with some mismatched language, but I doubt it.
Thanks for the informative response,
Steve
More information about the freebsd-questions
mailing list