[RFC] Should vfs.nfsrv.async be implemented for new NFS server?

Sat Nov 19 15:34:20 UTC 2011

On Sat, 19 Nov 2011, Rick Macklem wrote:

> Bruce Evans wrote:
>> On Sat, 5 Nov 2011, Rick Macklem wrote:
>>
>> I was away on 5 Nov...
>>
>>> Josh Paetzel pointed out that vfs.nfsrv.async doesn't exist
>>> for the new NFS server.
>>
>> Excellent! It's existence for the old nfs server is a bug. The correct
>> way to enable async is to mount the client async. nfsclient only has
>> a tiny amount of support for this, but it seems to be suficient:
>>
>> % int
>> % ncl_writerpc(struct vnode *vp, struct uio *uiop, struct ucred *cred,
>> % int *iomode, int *must_commit, int called_from_strategy)
>> % {
>> % ...
>> % if (vp->v_mount->mnt_kern_flag & MNTK_ASYNC)
>> % *iomode = NFSWRITE_FILESYNC;
>> % if (error && NFS_ISV4(vp))
>> % error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0);
>> % return (error);
>> % }
>> ...
>>> ps: Here's the patch, just in case anyone is interested.
>>> --- fs/nfsserver/nfs_nfsdserv.c.sav 2011-11-05 10:29:38.000000000
>>> -0400
>>> +++ fs/nfsserver/nfs_nfsdserv.c 2011-11-05 10:37:40.000000000 -0400
>>> @@ -55,6 +55,11 @@ extern int nfs_rootfhset;
>>> extern int nfsrv_enable_crossmntpt;
>>> #endif /* !APPLEKEXT */
>>>
>>> +static int nfs_async = 0;
>>> +SYSCTL_DECL(_vfs_nfsd);
>>> +SYSCTL_INT(_vfs_nfsd, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0,
>>> + "Tell client that writes were synced even though they were not");
>>> +
>>> /*
>>>  * This list defines the GSS mechanisms supported.
>>>  * (Don't ask me how you get these strings from the RFC stuff like
>>> @@ -912,7 +917,13 @@ nfsrvd_write(struct nfsrv_descript *nd,
>>> 			goto out;
>>> 		NFSM_BUILD(tl, u_int32_t *, 4 * NFSX_UNSIGNED);
>>> 		*tl++ = txdr_unsigned(retlen);
>>> - if (stable == NFSWRITE_UNSTABLE)
>>> + /*
>>> + * If nfs_async is set, then pretend the write was FILESYNC.
>>> + * Warning: Doing this violates RFC1813 and runs a risk
>>> + * of data written by a client being lost when the server
>>> + * crashes/reboots.
>>> + */
>>> + if (stable == NFSWRITE_UNSTABLE && nfs_async == 0)
>>> 			*tl++ = txdr_unsigned(stable);
>>> 		else
>>> 			*tl++ = txdr_unsigned(NFSWRITE_FILESYNC);
>>
>> This is for the v3 and v4 cases, the same as in the old nfs server
>> where its
>> existence is clearly just a bug. This is missing support for the v2
>> case
>> where its existence is not so clearly just a bug. Here is the code for
>> the
>> v2 case in the old nfs server:
>>
> An NFSv2 write is always "FILESYNC". There is no reply argument to say
> otherwise. The concept of NFSWRITE_UNSTABLE was added for NFSv3.
>
>> % int
>> % nfsrv_write(struct nfsrv_descript *nfsd, struct nfssvc_sock *slp,
>> % struct mbuf **mrq)
>> % {
>> % ...
>> % if (v3) {
>> % tl = nfsm_dissect_nonblock(u_int32_t *, 5 * NFSX_UNSIGNED);
>> % off = fxdr_hyper(tl);
>> % tl += 3;
>> % stable = fxdr_unsigned(int, *tl++);
>> % } else {
>> % tl = nfsm_dissect_nonblock(u_int32_t *, 4 * NFSX_UNSIGNED);
>> % off = (off_t)fxdr_unsigned(u_int32_t, *++tl);
>> % tl += 2;
>> % if (nfs_async)
>> % stable = NFSV3WRITE_UNSTABLE;
>>
>> The v2 case can't tell what the client asked for, so it frobs `stable'
>> unconditionally here instead of later depending on the value of
>> `stable'.
>> But perhaps doing nothing here works OK, since the client does the
>> same frobbing (just as unconditionally even for v3-4).
>>
> Yep, as above. There is no reply argument and the NFSv2 client will
> assume FILESYNC. So, I believe doing nothing is fine.

No, this means that for async to work for v2, the server must do the
frobbing, and the sysctl is needed to enable this :-(.  I knew this,
but had forgotten the details and tried too hard to kill the sysctl.

>
>> % }
>> % ...
>> % if (v3) {
>> % ...
>> % /*
>> % * If nfs_async is set, then pretend the write was FILESYNC.
>> % */
>> % if (stable == NFSV3WRITE_UNSTABLE && !nfs_async)
>> % *tl++ = txdr_unsigned(stable);
>> % else
>> % *tl++ = txdr_unsigned(NFSV3WRITE_FILESYNC);
>>
>> Perhaps the hack in the client can be improved slightly by using the
>> same
>> logic as here -- only converting to FILESYNC if the server said
>> UNSTABLE.
>> I don't know if other states can actually happen.
>>
> There is one other state called NFSWRITE_METADATA (means the metadata
> has been sync'd but the data hasn't). The FreeBSD server never does this,
> but it seems that, if it is going to lie about it, it might as well
> lie about the METADATA case as well.

The new nfs server has no references to META-anything, but the old nfs
server has the following horribleness which includes lying about this:

nfs.h:
% /*
%  * The IO_METASYNC flag should be implemented for local filesystems.
%  * (Until then, it is nothin at all.)
%  */
% #ifndef IO_METASYNC
% #define IO_METASYNC	0
% #endif

nfsserv.c:
% 	    /*
% 	     * XXX
% 	     * The IO_METASYNC flag indicates that all metadata (and not just
% 	     * enough to ensure data integrity) mus be written to stable storage
% 	     * synchronously.
% 	     * (IO_METASYNC is not yet implemented in 4.4BSD-Lite.)
% 	     */
% 	    if (stable == NFSV3WRITE_UNSTABLE)
% 		ioflags = IO_NODELOCKED;
% 	    else if (stable == NFSV3WRITE_DATASYNC)
% 		ioflags = (IO_SYNC | IO_NODELOCKED);
% 	    else
% 		ioflags = (IO_METASYNC | IO_SYNC | IO_NODELOCKED);

Apparently, there are only 3 states; the 3rd one is METASYNC and is
handled by the else clause; but since IO_METASYNC doesn't exist in
4.4BSD-Lite (sic) and is hacked to 0, the else clause is just an
obfuscated spelling of the else if clause and there might as well be
only 2 states.

This used to be sort of correct, since METASYNC was the default for
ffs (sync metadata, async data, with no control over this short of
mounting sync to get globally sync data, and O_SYNC to get locally
sync data).  Now there are the following complications:
- async mounts should give async everything.  The nfs server has no
   understanding of these, so it will report FILESYNC although the
   underlying file system didn't even sync the metadata.  Even if
   it understood them, it has no way short of VOP_FSYNC() to force
   file systems to follow the client's requests for syncness.  (It
   does use VOP_FSYNC(), but I think not enough to work for every
   write.  Also, VOP_FSYNC() is usually buggy with async mounts and
   tends to sync all data but leave some metadata unsynced.)
- soft updates doesn't give sync metadata (only a good ordering of
   syncing)
- alternative file systems which have very buggy syncness (e.g.,
   msdosfs) or sophistication which probably includes different
   syncness (e.g., zfs).

Bruce