Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Thu Aug 25 22:14:17 UTC 2011

On Thu, 25 Aug 2011, John Baldwin wrote:

> I was doing some analysis of compiles over NFS at work recently and noticed
> from 'iostat 1' on the NFS server that all my NFS writes were always 16k
> writes (meaning that writes were never being clustered).  I added some

Did you see the old patches for this by Bjorn Gronwall?  They went through
many iterations.  He was mainly interested in the !async case and I was
mainly interested in the async case...

> debugging sysctls to the NFS client and server code as well as the FFS write
> VOP to figure out the various kind of write requests that were being sent.  I
> found that during the NFS compile, the NFS client was sending a lot of
> FILESYNC writes even though nothing in the compile process uses fsync().
> Based on the debugging I added, I found that all of the FILESYNC writes were
> marked as such because the buffer in question did not have B_ASYNC set:
>
>
> 		if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) == B_ASYNC)
> 		    iomode = NFSV3WRITE_UNSTABLE;
> 		else
> 		    iomode = NFSV3WRITE_FILESYNC;

The async-mounted case (that's async on the server; async on the client
bogusly succeeds and does nothing or less) should make the writes async in
all cases, and thus make the beaviour independent of the client's sync
requests (except hopefully fsync(2) on the client works -- it is mostly
unbroken in my ffs servers), and almost independent of the client's write
sizes (they should get grouped into large clusters after they have had
time to accumulate), and almost independed of the server's write clustering
at the nfs level (the vfs clustering level should do it).  However, there
was some problem with async too.

> I eventually tracked this down to the code in the NFS client that pushes out a
> previous dirty region via 'bwrite()' when a write would dirty a non-contiguous
> region in the buffer:
>
> 		if (bp->b_dirtyend > 0 &&
> 		    (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
> 			if (bwrite(bp) == EINTR) {
> 				error = EINTR;
> 				break;
> 			}
> 			goto again;
> 		}
>
> (These writes are triggered during the compile of a file by the assembler
> seeking back into the file it has already written out to apply various
> fixups.)

I mainly tested huge (~1GB) sequential writes, which is exactly the
opposite of this case.

>> From this I concluded that the test above is flawed.  We should be using
> UNSTABLE writes for the writes above as the user has not requested them to
> be synchronous.  The issue (I believe) is that the NFS client is overloading
> the B_ASYNC flag.  The B_ASYNC flag means that the caller of bwrite()
> (or rather bawrite()) is not synchronously blocking to see if the request
> has completed.  Instead, it is a "fire and forget".  This is not the same
> thing as the IO_SYNC flag passed in ioflags during a write request which
> requests fsync()-like behavior.  To disambiguate the two I added a new
> B_SYNC flag and changed the NFS clients to set this for write requests
> with IO_SYNC set.  I then updated the condition above to instead check for
> B_SYNC being set rather than checking for B_ASYNC being clear.

Seems reasonable.

> That converted all the FILESYNC write RPCs from my builds into UNSTABLE
> write RPCs.  The patch for that is at
> http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch.
>
> However, even with this change I was still not getting clustered writes on
> the NFS server (all writes were still 16k).  After digging around in the
> code for a bit I found that ffs will only cluster writes if the passed in
> 'ioflags' to ffs_write() specify a sequential hint.  I then noticed that
> the NFS server has code to keep track of sequential I/O heuristics for
> reads, but not writes.  I took the code from the NFS server's read op

Bjorn's patches were mostly for this.  We also tried changing the
write gathering code and/or sysctls to control it.

> and moved it into a function to compute a sequential I/O heuristic that
> could be shared by both reads and writes.  I also updated the sequential
> heuristic code to advance the counter based on the number of 16k blocks
> in each write instead of just doing ++ to match what we do for local
> file writes in sequential_heuristic() in vfs_vnops.c.  Using this did
> give me some measure of NFS write clustering (though I can't peg my
> disks at MAXPHYS the way a dd to a file on a local filesystem can).  The

I got close to it.  The failure modes were mostly burstiness of i/o, where
the server buffer cache seemed to fill up so the client would stop sending
and stay stopped for too long (several seconds; enough to reduce the
throughput by 40-60%).

> patch for these changes is at
> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
>
> (This also fixes a bug in the new NFS server in that it wasn't actually
> clustering reads since it never updated nh->nh_nextr.)
>
> Combining the two changes together gave me about a 1% reduction in wall
> time for my builds:
>
> +------------------------------------------------------------------------------+
> |+                   +     ++    + +x++*x  xx+x    x                          x|
> |                 |___________A__|_M_______|_A____________|                    |
> +------------------------------------------------------------------------------+
>    N           Min           Max        Median           Avg        Stddev
> x  10       1869.62       1943.11       1881.89       1886.12     21.549724
> +  10       1809.71       1886.53       1869.26      1860.706     21.530664
> Difference at 95.0% confidence
>        -25.414 +/- 20.2391
>        -1.34742% +/- 1.07305%
>        (Student's t, pooled s = 21.5402)
>
> One caveat: I tested both of these patches on the old NFS client and server
> on 8.2-stable.  I then ported the changes to the new client and server and
> while I made sure they compiled, I have not tested the new client and server.

Here is the version of Bjorn's patches that I last used (in 8-current in
2008):

% Index: nfs_serv.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v
% retrieving revision 1.182
% diff -u -2 -r1.182 nfs_serv.c
% --- nfs_serv.c	28 May 2008 16:23:17 -0000	1.182
% +++ nfs_serv.c	1 Jun 2008 05:52:45 -0000
% @@ -107,14 +107,18 @@
%  #define MAX_COMMIT_COUNT	(1024 * 1024)
% 
% -#define NUM_HEURISTIC		1017
% +#define NUM_HEURISTIC		1031 /* Must be prime! */
% +#define MAX_REORDERED_RPC	4
% +#define HASH_MAXSTEP		0x3ff
%  #define NHUSE_INIT		64
%  #define NHUSE_INC		16
%  #define NHUSE_MAX		2048
% +#define NH_TAG(vp)		((uint32_t)((uintptr_t)vp / sizeof(struct vnode)))
% +CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1));
% 
%  static struct nfsheur {
% -	struct vnode *nh_vp;	/* vp to match (unreferenced pointer) */
% -	off_t nh_nextr;		/* next offset for sequential detection */
% -	int nh_use;		/* use count for selection */
% -	int nh_seqcount;	/* heuristic */
% +	off_t nh_nextoff;	/* next offset for sequential detection */
% +	uint32_t nh_tag;	/* vp tag to match */
% +	uint16_t nh_use;	/* use count for selection */
% +	uint16_t nh_seqcount;	/* in units of logical blocks */
%  } nfsheur[NUM_HEURISTIC];
% 
% @@ -131,8 +135,14 @@
%  static int nfs_commit_blks;
%  static int nfs_commit_miss;
% +static int nfsrv_cluster_writes = 1;
% +static int nfsrv_cluster_reads = 1;
% +static int nfsrv_reordered_io;
%  SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, "");
%  SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, "");
%  SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, "");
% 
% +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, "");
% +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, "");
% +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, "");
%  struct nfsrvstats nfsrvstats;
%  SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW,
% @@ -145,4 +155,73 @@
% 
%  /*
% + * Detect sequential access so that we are able to hint the underlying
% + * file system to use clustered I/O when appropriate.
% + */
% +static int
% +nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp)
% +{
% +	struct nfsheur *nh;
% +	unsigned hi, step;
% +	int try = 8;
% +	int nblocks, lblocksize;
% +
% +	/*
% +	 * Locate best nfsheur[] candidate using double hashing.
% +	 */
% +
% +	hi =   NH_TAG(vp) % NUM_HEURISTIC;
% +	step = NH_TAG(vp) & HASH_MAXSTEP;
% +	step++;			/* Step must not be zero. */
% +	nh = &nfsheur[hi];
% +
% +	while (try--) {
% +		if (nfsheur[hi].nh_tag == NH_TAG(vp)) {
% +			nh = &nfsheur[hi];
% +			break;
% +		}
% +		if (nfsheur[hi].nh_use > 0)
% +			--nfsheur[hi].nh_use;
% +		hi = hi + step;
% +		if (hi >= NUM_HEURISTIC)
% +			hi -= NUM_HEURISTIC;
% +		if (nfsheur[hi].nh_use < nh->nh_use)
% +			nh = &nfsheur[hi];
% +	}
% +
% +	if (nh->nh_tag != NH_TAG(vp)) {	/* New entry. */
% +		nh->nh_tag = NH_TAG(vp);
% +		nh->nh_nextoff = uio->uio_offset;
% +		nh->nh_use = NHUSE_INIT;
% +		nh->nh_seqcount = 1; /* Initially assume sequential access. */
% +	} else {
% +		nh->nh_use += NHUSE_INC;
% +		if (nh->nh_use > NHUSE_MAX)
% +			nh->nh_use = NHUSE_MAX;
% +	}
% +
% +	/*
% +	 * Calculate heuristic
% +	 */
% +
% +	lblocksize = vp->v_mount->mnt_stat.f_iosize;
% +	nblocks = howmany(uio->uio_resid, lblocksize);
% +	if (uio->uio_offset == nh->nh_nextoff) {
% +		nh->nh_seqcount += nblocks;
% +		if (nh->nh_seqcount > IO_SEQMAX)
% +			nh->nh_seqcount = IO_SEQMAX;
% +	} else if (uio->uio_offset == 0) {
% +		/* Seek to beginning of file, ignored. */
% +	} else if (qabs(uio->uio_offset - nh->nh_nextoff) <=
% +		   MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) {
% +		nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */
% +	} else
% +		nh->nh_seqcount /= 2; /* Not sequential access. */
% +
% +	nh->nh_nextoff = uio->uio_offset + uio->uio_resid;
% +
% +	return (nh->nh_seqcount << IO_SEQSHIFT);
% +}
% +
% +/*
%   * Clear nameidata fields that are tested in nsfmout cleanup code prior
%   * to using first nfsm macro (that might jump to the cleanup code).
% @@ -785,5 +864,4 @@
%  	struct uio io, *uiop = &io;
%  	struct vattr va, *vap = &va;
% -	struct nfsheur *nh;
%  	off_t off;
%  	int ioflag = 0;
% @@ -857,59 +935,4 @@
%  		cnt = reqlen;
% 
% -	/*
% -	 * Calculate seqcount for heuristic
% -	 */
% -
% -	{
% -		int hi;
% -		int try = 32;
% -
% -		/*
% -		 * Locate best candidate
% -		 */
% -
% -		hi = ((int)(vm_offset_t)vp / sizeof(struct vnode)) % NUM_HEURISTIC;
% -		nh = &nfsheur[hi];
% -
% -		while (try--) {
% -			if (nfsheur[hi].nh_vp == vp) {
% -				nh = &nfsheur[hi];
% -				break;
% -			}
% -			if (nfsheur[hi].nh_use > 0)
% -				--nfsheur[hi].nh_use;
% -			hi = (hi + 1) % NUM_HEURISTIC;
% -			if (nfsheur[hi].nh_use < nh->nh_use)
% -				nh = &nfsheur[hi];
% -		}
% -
% -		if (nh->nh_vp != vp) {
% -			nh->nh_vp = vp;
% -			nh->nh_nextr = off;
% -			nh->nh_use = NHUSE_INIT;
% -			if (off == 0)
% -				nh->nh_seqcount = 4;
% -			else
% -				nh->nh_seqcount = 1;
% -		}
% -
% -		/*
% -		 * Calculate heuristic
% -		 */
% -
% -		if ((off == 0 && nh->nh_seqcount > 0) || off == nh->nh_nextr) {
% -			if (++nh->nh_seqcount > IO_SEQMAX)
% -				nh->nh_seqcount = IO_SEQMAX;
% -		} else if (nh->nh_seqcount > 1) {
% -			nh->nh_seqcount = 1;
% -		} else {
% -			nh->nh_seqcount = 0;
% -		}
% -		nh->nh_use += NHUSE_INC;
% -		if (nh->nh_use > NHUSE_MAX)
% -			nh->nh_use = NHUSE_MAX;
% -		ioflag |= nh->nh_seqcount << IO_SEQSHIFT;
% -        }
% -
%  	nfsm_reply(NFSX_POSTOPORFATTR(v3) + 3 * NFSX_UNSIGNED+nfsm_rndup(cnt));
%  	if (v3) {
% @@ -969,7 +992,8 @@
%  		uiop->uio_rw = UIO_READ;
%  		uiop->uio_segflg = UIO_SYSSPACE;
% +		if (nfsrv_cluster_reads)
% +			ioflag |= nfsrv_sequential_access(uiop, vp);
%  		error = VOP_READ(vp, uiop, IO_NODELOCKED | ioflag, cred);
%  		off = uiop->uio_offset;
% -		nh->nh_nextr = off;
%  		FREE((caddr_t)iv2, M_TEMP);
%  		if (error || (getret = VOP_GETATTR(vp, vap, cred, td))) {
% @@ -1177,4 +1201,6 @@
%  	    uiop->uio_td = NULL;
%  	    uiop->uio_offset = off;
% +	    if (nfsrv_cluster_writes)
% +		ioflags |= nfsrv_sequential_access(uiop, vp);
%  	    error = VOP_WRITE(vp, uiop, ioflags, cred);
%  	    /* XXXRW: unlocked write. */
% @@ -1225,4 +1251,5 @@
%  	vn_finished_write(mntp);
%  	VFS_UNLOCK_GIANT(vfslocked);
% +	bwillwrite();	    /* After VOP_WRITE to avoid reordering. */
%  	return(error);
%  }
% @@ -1492,4 +1519,6 @@
%  		    }
%  		    if (!error) {
% +	    		if (nfsrv_cluster_writes)
% +				ioflags |= nfsrv_sequential_access(uiop, vp);
%  			error = VOP_WRITE(vp, uiop, ioflags, cred);
%  			/* XXXRW: unlocked write. */
% @@ -1582,4 +1611,5 @@
%  		}
%  	splx(s);
% +	bwillwrite();	    /* After VOP_WRITE to avoid reordering. */
%  	return (0);
%  }
% @@ -3827,5 +3857,9 @@
%  	for_ret = VOP_GETATTR(vp, &bfor, cred, td);
% 
% -	if (cnt > MAX_COMMIT_COUNT) {
% +	/*
% +	 * If count is 0, a flush from offset to the end of file
% +	 * should be performed according to RFC 1813.
% +	 */
% +	if (cnt == 0 || cnt > MAX_COMMIT_COUNT) {
%  		/*
%  		 * Give up and do the whole thing
% @@ -3871,5 +3905,5 @@
%  		bo = &vp->v_bufobj;
%  		BO_LOCK(bo);
% -		while (cnt > 0) {
% +		while (!error && cnt > 0) {
%  			struct buf *bp;
% 
% @@ -3894,5 +3928,5 @@
%  					bremfree(bp);
%  					bp->b_flags &= ~B_ASYNC;
% -					bwrite(bp);
% +					error = bwrite(bp);
%  					++nfs_commit_miss;
%  				} else
% Index: nfs_syscalls.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v
% retrieving revision 1.119
% diff -u -2 -r1.119 nfs_syscalls.c
% --- nfs_syscalls.c	30 Jun 2008 20:43:06 -0000	1.119
% +++ nfs_syscalls.c	2 Jul 2008 07:12:57 -0000
% @@ -86,5 +86,4 @@
%  int		nfsd_waiting = 0;
%  int		nfsrv_numnfsd = 0;
% -static int	notstarted = 1;
% 
%  static int	nfs_privport = 0;
% @@ -448,7 +447,6 @@
%  			    procrastinate = nfsrvw_procrastinate;
%  			NFSD_UNLOCK();
% -			if (writes_todo || (!(nd->nd_flag & ND_NFSV3) &&
% -			    nd->nd_procnum == NFSPROC_WRITE &&
% -			    procrastinate > 0 && !notstarted))
% +			if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE &&
% +			    procrastinate > 0))
%  			    error = nfsrv_writegather(&nd, slp,
%  				nfsd->nfsd_td, &mreq);

I have forgotten many details but can dig them out from large private
mails about this if you care.  It looks like most of the above is about
getting the seqcount write; probably similar to what you did.  -current
is still missing the error check for the bwrite() -- that's the only
bwrite() in the whole server.  However, the fix isn't good -- one obvious
bug is that a later successful bwrite() destroys the evidence of a
previous bwrite() error.

Bruce