FreeBSD NFS client/Linux NFS server issue

Fri Jan 22 20:37:54 UTC 2010

On Fri, 22 Jan 2010 14:37:48 -0500 (EST) Rick Macklem wrote:

>> --- nfs_bio.c.orig      2010-01-22 15:38:02.000000000 +0000
>> +++ nfs_bio.c   2010-01-22 15:39:58.000000000 +0000
>> @@ -1385,7 +1385,7 @@ again:
>>         */
>>        if (!gotiod) {
>>                iod = nfs_nfsiodnew();
>> -               if (iod != -1)
>> +               if ((iod != -1) && (nfs_iodwant[iod] == NULL))
>>                        gotiod = TRUE;
>>        }
>>
>
> Unfortunately, I don't think the above fixes the problem.
> If another thread that called nfs_asyncio() has "stolen" the this "iod",
> it will have set nfs_iodwant[iod] == NULL (set non-NULL at #238)
> and it will remain NULL until the other thread is done with it.

I see. I have missed this. Thanks.

>
> There should probably be some sort of 3 way handshake between
> the code in nfs_asyncio() after calling nfs_nfsnewiod() and the
> code near the beginning of nfssvc_iod(), but I think the following
> somewhat cheesy fix might do the trick:
>
> 	if (!gotiod) {
> 		iod = nfs_nfsiodnew();
> 		if (iod != -1) {
> 			if (nfs_iodwant[iod] == NULL) {
> 				/*
> 				 * Either another thread has acquired this
> 				 * iod or I acquired the nfs_iod_mtx mutex
> 				 * before the new iod thread did in
> 				 * nfssvc_iod(). To be safe, go back and
> 				 * try again after allowing another thread
> 				 * to acquire the nfs_iod_mtx mutex.
> 				 */
> 				mtx_unlock(&nfs_iod_mtx);
> 				/*
> 				 * So long as mtx_lock() implements some
> 				 * sort of fairness, nfssvc_iod() should
> 				 * get nfs_iod_mtx here and set
> 				 * nfs_iodwant[iod] != NULL for the case
> 				 * where the iod has not been "stolen" by
> 				 * another thread for a different mount
> 				 * point.
> 				 */
> 				mtx_lock(&nfs_iod_mtx);
> 				goto again;
> 			}
> 			gotiod = TRUE;
> 		}
> 	}
>
> Does anyone else have a better solution?
> (Mikolaj, could you by any chance test this? You can test yours, but I
> think it breaks.)

Unfortunately we observed this only on our production servers. A week ago we
made some changes in configuration as workaround -- reconfigure cron no to run
scripts simultaneously, set the scripts in cron that just periodically write a
line to the file on nfs share (to "unlock" it if it is locked). We have not
been observed problems since then and we would not like to experiment in
production. If I manage to produce good test case in test environment I will
be able to test the patch but I am not sure...

-- 
Mikolaj Golub