[RFC] Should vfs.nfsrv.async be implemented for new NFS server?

Sat Nov 19 14:50:06 UTC 2011

Bruce Evans wrote:
> On Sat, 5 Nov 2011, Rick Macklem wrote:
> 
> I was away on 5 Nov...
> 
> > Josh Paetzel pointed out that vfs.nfsrv.async doesn't exist
> > for the new NFS server.
> 
> Excellent! It's existence for the old nfs server is a bug. The correct
> way to enable async is to mount the client async. nfsclient only has
> a tiny amount of support for this, but it seems to be suficient:
> 
> % int
> % ncl_writerpc(struct vnode *vp, struct uio *uiop, struct ucred *cred,
> % int *iomode, int *must_commit, int called_from_strategy)
> % {
> % ...
> % if (vp->v_mount->mnt_kern_flag & MNTK_ASYNC)
> % *iomode = NFSWRITE_FILESYNC;
> % if (error && NFS_ISV4(vp))
> % error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0);
> % return (error);
> % }
> 
> (The old nfs server is similar.) As far as I understand this (not very
> far), this "works" by pretending that the server completed the sync
> even when the server said otherwise. This is far from being correct,
> but a correct version seems too hard since it would require servers
> (not just FreeBSD ones) to actually understand asyncness.
> 
> The sysctl shouldn't exist because it just causes the same pretense
> to happen on the server. This just applies the hack too globally
> -- it now affects all mount instances on all clients. However,
> there may be a problem for nfsv2. I used to think that v3 clients
> were much less hackish than the above, and implemented async mounts
> by not asking servers to do a full NFS[V3]WRITE_FILESYNC for async
> mounts. nfsv2 clients can't do that since the nvfs2 protocol is too
> simple to be able to request different types and/or levels of syncing.
> Now I think that the above hack works for v2 too, but is just a hack.
> 
> > I don't think I had spotted this before, but when I looked I
> > saw that, when vfs.nfsrv.async is set non-zero in the old server,
> > it returns FILESYNC (which means the write has been committed to
> > non-volatile storage) even when it hasn't actually done that.
> >
> > This can improve performance, but has some negative implications:
> > - If the server crashes before the write is committed to
> >  non-volatile storage, the file modification will be lost.
> >  (When a server replies UNSTABLE to a write, the client holds
> >   onto the data in its cache and does the write again if the
> >   server crashes/reboots before the client does a Commit RPC
> >   for the file. However, a reply of FILESYNC tells the client
> >   it can forget about the write, because it is done.)
> > - Because of the above, replying FILESYNC when the data is not
> >  yet committed to non-volatile (also referred to as stable)
> >  storage, this is a violation of RFC1813.
> >
> > I wouldn't want this to be the default, but am willing to
> > patch head based on the "backwards compatibility" argument.
> > My concern with these types of patches is that some people
> > will enable them without realizing the risk of data loss
> > that they introduce.
> >
> > So, how do others feel with respect to whether or not this
> > patch should be committed to head?
> >
> > Thanks in advance for any comments, rick
> > ps: Here's the patch, just in case anyone is interested.
> > --- fs/nfsserver/nfs_nfsdserv.c.sav 2011-11-05 10:29:38.000000000
> > -0400
> > +++ fs/nfsserver/nfs_nfsdserv.c 2011-11-05 10:37:40.000000000 -0400
> > @@ -55,6 +55,11 @@ extern int nfs_rootfhset;
> > extern int nfsrv_enable_crossmntpt;
> > #endif /* !APPLEKEXT */
> >
> > +static int nfs_async = 0;
> > +SYSCTL_DECL(_vfs_nfsd);
> > +SYSCTL_INT(_vfs_nfsd, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0,
> > + "Tell client that writes were synced even though they were not");
> > +
> > /*
> >  * This list defines the GSS mechanisms supported.
> >  * (Don't ask me how you get these strings from the RFC stuff like
> > @@ -912,7 +917,13 @@ nfsrvd_write(struct nfsrv_descript *nd,
> > 			goto out;
> > 		NFSM_BUILD(tl, u_int32_t *, 4 * NFSX_UNSIGNED);
> > 		*tl++ = txdr_unsigned(retlen);
> > - if (stable == NFSWRITE_UNSTABLE)
> > + /*
> > + * If nfs_async is set, then pretend the write was FILESYNC.
> > + * Warning: Doing this violates RFC1813 and runs a risk
> > + * of data written by a client being lost when the server
> > + * crashes/reboots.
> > + */
> > + if (stable == NFSWRITE_UNSTABLE && nfs_async == 0)
> > 			*tl++ = txdr_unsigned(stable);
> > 		else
> > 			*tl++ = txdr_unsigned(NFSWRITE_FILESYNC);
> 
> This is for the v3 and v4 cases, the same as in the old nfs server
> where its
> existence is clearly just a bug. This is missing support for the v2
> case
> where its existence is not so clearly just a bug. Here is the code for
> the
> v2 case in the old nfs server:
> 
An NFSv2 write is always "FILESYNC". There is no reply argument to say
otherwise. The concept of NFSWRITE_UNSTABLE was added for NFSv3.

> % int
> % nfsrv_write(struct nfsrv_descript *nfsd, struct nfssvc_sock *slp,
> % struct mbuf **mrq)
> % {
> % ...
> % if (v3) {
> % tl = nfsm_dissect_nonblock(u_int32_t *, 5 * NFSX_UNSIGNED);
> % off = fxdr_hyper(tl);
> % tl += 3;
> % stable = fxdr_unsigned(int, *tl++);
> % } else {
> % tl = nfsm_dissect_nonblock(u_int32_t *, 4 * NFSX_UNSIGNED);
> % off = (off_t)fxdr_unsigned(u_int32_t, *++tl);
> % tl += 2;
> % if (nfs_async)
> % stable = NFSV3WRITE_UNSTABLE;
> 
> The v2 case can't tell what the client asked for, so it frobs `stable'
> unconditionally here instead of later depending on the value of
> `stable'.
> But perhaps doing nothing here works OK, since the client does the
> same frobbing (just as unconditionally even for v3-4).
> 
Yep, as above. There is no reply argument and the NFSv2 client will
assume FILESYNC. So, I believe doing nothing is fine.

> % }
> % ...
> % if (v3) {
> % ...
> % /*
> % * If nfs_async is set, then pretend the write was FILESYNC.
> % */
> % if (stable == NFSV3WRITE_UNSTABLE && !nfs_async)
> % *tl++ = txdr_unsigned(stable);
> % else
> % *tl++ = txdr_unsigned(NFSV3WRITE_FILESYNC);
> 
> Perhaps the hack in the client can be improved slightly by using the
> same
> logic as here -- only converting to FILESYNC if the server said
> UNSTABLE.
> I don't know if other states can actually happen.
> 
There is one other state called NFSWRITE_METADATA (means the metadata
has been sync'd but the data hasn't). The FreeBSD server never does this,
but it seems that, if it is going to lie about it, it might as well
lie about the METADATA case as well.

rick