Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Thu Aug 25 21:09:31 UTC 2011

On Thursday, August 25, 2011 3:24:15 pm Bruce Evans wrote:
> On Thu, 25 Aug 2011, John Baldwin wrote:
> 
> > I was doing some analysis of compiles over NFS at work recently and noticed
> > from 'iostat 1' on the NFS server that all my NFS writes were always 16k
> > writes (meaning that writes were never being clustered).  I added some
> 
> Did you see the old patches for this by Bjorn Gronwall?  They went through
> many iterations.  He was mainly interested in the !async case and I was
> mainly interested in the async case...

Ah, no I had not seen these, thanks.

> > and moved it into a function to compute a sequential I/O heuristic that
> > could be shared by both reads and writes.  I also updated the sequential
> > heuristic code to advance the counter based on the number of 16k blocks
> > in each write instead of just doing ++ to match what we do for local
> > file writes in sequential_heuristic() in vfs_vnops.c.  Using this did
> > give me some measure of NFS write clustering (though I can't peg my
> > disks at MAXPHYS the way a dd to a file on a local filesystem can).  The
> 
> I got close to it.  The failure modes were mostly burstiness of i/o, where
> the server buffer cache seemed to fill up so the client would stop sending
> and stay stopped for too long (several seconds; enough to reduce the
> throughput by 40-60%).

Hmm, I can get writes up to around 40-50k, but not 128k.  My test is to just
dd from /dev/zero to a file on the NFS client using a blocksize of 64k or so.

> > patch for these changes is at
> > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
> >
> > (This also fixes a bug in the new NFS server in that it wasn't actually
> > clustering reads since it never updated nh->nh_nextr.)
> >
> > Combining the two changes together gave me about a 1% reduction in wall
> > time for my builds:
> >
> > +------------------------------------------------------------------------------+
> > |+                   +     ++    + +x++*x  xx+x    x                          x|
> > |                 |___________A__|_M_______|_A____________|                    |
> > +------------------------------------------------------------------------------+
> >    N           Min           Max        Median           Avg        Stddev
> > x  10       1869.62       1943.11       1881.89       1886.12     21.549724
> > +  10       1809.71       1886.53       1869.26      1860.706     21.530664
> > Difference at 95.0% confidence
> >        -25.414 +/- 20.2391
> >        -1.34742% +/- 1.07305%
> >        (Student's t, pooled s = 21.5402)
> >
> > One caveat: I tested both of these patches on the old NFS client and server
> > on 8.2-stable.  I then ported the changes to the new client and server and
> > while I made sure they compiled, I have not tested the new client and server.
> 
> Here is the version of Bjorn's patches that I last used (in 8-current in
> 2008):
> 
> % Index: nfs_serv.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v
> % retrieving revision 1.182
> % diff -u -2 -r1.182 nfs_serv.c
> % --- nfs_serv.c	28 May 2008 16:23:17 -0000	1.182
> % +++ nfs_serv.c	1 Jun 2008 05:52:45 -0000
> % @@ -107,14 +107,18 @@
> %  #define MAX_COMMIT_COUNT	(1024 * 1024)
> % 
> % -#define NUM_HEURISTIC		1017
> % +#define NUM_HEURISTIC		1031 /* Must be prime! */
> % +#define MAX_REORDERED_RPC	4
> % +#define HASH_MAXSTEP		0x3ff
> %  #define NHUSE_INIT		64
> %  #define NHUSE_INC		16
> %  #define NHUSE_MAX		2048
> % +#define NH_TAG(vp)		((uint32_t)((uintptr_t)vp / sizeof(struct vnode)))
> % +CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1));

Hmm, aside from 1017 not being prime (3 * 339), I'm not sure what the reasons
for the rest of these changes are.

> % 
> %  static struct nfsheur {
> % -	struct vnode *nh_vp;	/* vp to match (unreferenced pointer) */
> % -	off_t nh_nextr;		/* next offset for sequential detection */
> % -	int nh_use;		/* use count for selection */
> % -	int nh_seqcount;	/* heuristic */
> % +	off_t nh_nextoff;	/* next offset for sequential detection */
> % +	uint32_t nh_tag;	/* vp tag to match */
> % +	uint16_t nh_use;	/* use count for selection */
> % +	uint16_t nh_seqcount;	/* in units of logical blocks */
> %  } nfsheur[NUM_HEURISTIC];

Hmm, not sure why this only stores the tag and uses uint16_t values here.
The size difference is a few KB at best, and I'd rather store the full
vnode to avoid oddities from hash collisions.

> % @@ -131,8 +135,14 @@
> %  static int nfs_commit_blks;
> %  static int nfs_commit_miss;
> % +static int nfsrv_cluster_writes = 1;
> % +static int nfsrv_cluster_reads = 1;
> % +static int nfsrv_reordered_io;
> %  SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, "");
> %  SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, "");
> %  SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, "");
> % 
> % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, "");
> % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, "");
> % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, "");
> %  struct nfsrvstats nfsrvstats;
> %  SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW,
> % @@ -145,4 +155,73 @@
> % 
> %  /*
> % + * Detect sequential access so that we are able to hint the underlying
> % + * file system to use clustered I/O when appropriate.
> % + */
> % +static int
> % +nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp)
> % +{
> % +	struct nfsheur *nh;
> % +	unsigned hi, step;
> % +	int try = 8;
> % +	int nblocks, lblocksize;
> % +
> % +	/*
> % +	 * Locate best nfsheur[] candidate using double hashing.
> % +	 */
> % +
> % +	hi =   NH_TAG(vp) % NUM_HEURISTIC;
> % +	step = NH_TAG(vp) & HASH_MAXSTEP;
> % +	step++;			/* Step must not be zero. */
> % +	nh = &nfsheur[hi];

I can't speak to whether using a variable step makes an appreciable
difference.  I have not examined that in detail in my tests.

> % +
> % +	while (try--) {
> % +		if (nfsheur[hi].nh_tag == NH_TAG(vp)) {
> % +			nh = &nfsheur[hi];
> % +			break;
> % +		}
> % +		if (nfsheur[hi].nh_use > 0)
> % +			--nfsheur[hi].nh_use;
> % +		hi = hi + step;
> % +		if (hi >= NUM_HEURISTIC)
> % +			hi -= NUM_HEURISTIC;
> % +		if (nfsheur[hi].nh_use < nh->nh_use)
> % +			nh = &nfsheur[hi];
> % +	}
> % +
> % +	if (nh->nh_tag != NH_TAG(vp)) {	/* New entry. */
> % +		nh->nh_tag = NH_TAG(vp);
> % +		nh->nh_nextoff = uio->uio_offset;
> % +		nh->nh_use = NHUSE_INIT;
> % +		nh->nh_seqcount = 1; /* Initially assume sequential access. */
> % +	} else {
> % +		nh->nh_use += NHUSE_INC;
> % +		if (nh->nh_use > NHUSE_MAX)
> % +			nh->nh_use = NHUSE_MAX;
> % +	}
> % +
> % +	/*
> % +	 * Calculate heuristic
> % +	 */
> % +
> % +	lblocksize = vp->v_mount->mnt_stat.f_iosize;
> % +	nblocks = howmany(uio->uio_resid, lblocksize);

This is similar to what I pulled out of sequential_heuristic() except
that it doesn't hardcode 16k.  There is a big comment above the 16k
that says it isn't about the blocksize though, so I'm not sure which is
most correct.  I imagine we'd want to use the same strategy in both places
though.  Comment from vfs_vnops.c:

		/*
		 * f_seqcount is in units of fixed-size blocks so that it
		 * depends mainly on the amount of sequential I/O and not
		 * much on the number of sequential I/O's.  The fixed size
		 * of 16384 is hard-coded here since it is (not quite) just
		 * a magic size that works well here.  This size is more
		 * closely related to the best I/O size for real disks than
		 * to any block size used by software.
		 */
		fp->f_seqcount += howmany(uio->uio_resid, 16384);

> % +	if (uio->uio_offset == nh->nh_nextoff) {
> % +		nh->nh_seqcount += nblocks;
> % +		if (nh->nh_seqcount > IO_SEQMAX)
> % +			nh->nh_seqcount = IO_SEQMAX;
> % +	} else if (uio->uio_offset == 0) {
> % +		/* Seek to beginning of file, ignored. */
> % +	} else if (qabs(uio->uio_offset - nh->nh_nextoff) <=
> % +		   MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) {
> % +		nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */

Ah, this is a nice touch!  I had noticed reordered I/O's resetting my
clustered I/O count.  I should try this extra step.

> % +	} else
> % +		nh->nh_seqcount /= 2; /* Not sequential access. */

Hmm, this is a bit different as well.  sequential_heuristic() just
drops all clustering (seqcount = 1) here so I had followed that.  I do
wonder if this change would be good for "normal" I/O as well?  (Again,
I think it would do well to have "normal" I/O and NFS generally use
the same algorithm, but perhaps with the extra logic to handle reordered
writes more gracefully for NFS.)

> % +
> % +	nh->nh_nextoff = uio->uio_offset + uio->uio_resid;

Interesting.  So this assumes the I/O never fails.

> % @@ -1225,4 +1251,5 @@
> %  	vn_finished_write(mntp);
> %  	VFS_UNLOCK_GIANT(vfslocked);
> % +	bwillwrite();	    /* After VOP_WRITE to avoid reordering. */
> %  	return(error);
> %  }

Hmm, this seems to be related to avoiding overloading the NFS server's
buffer cache?

> % @@ -1492,4 +1519,6 @@
> %  		    }
> %  		    if (!error) {
> % +	    		if (nfsrv_cluster_writes)
> % +				ioflags |= nfsrv_sequential_access(uiop, vp);
> %  			error = VOP_WRITE(vp, uiop, ioflags, cred);
> %  			/* XXXRW: unlocked write. */
> % @@ -1582,4 +1611,5 @@
> %  		}
> %  	splx(s);
> % +	bwillwrite();	    /* After VOP_WRITE to avoid reordering. */
> %  	return (0);
> %  }

Ah, this code is no longer present in 8 (but is in 7).

> % @@ -3827,5 +3857,9 @@
> %  	for_ret = VOP_GETATTR(vp, &bfor, cred, td);
> % 
> % -	if (cnt > MAX_COMMIT_COUNT) {
> % +	/*
> % +	 * If count is 0, a flush from offset to the end of file
> % +	 * should be performed according to RFC 1813.
> % +	 */
> % +	if (cnt == 0 || cnt > MAX_COMMIT_COUNT) {
> %  		/*
> %  		 * Give up and do the whole thing

This appears to be a seperate standalone fix.

> % @@ -3871,5 +3905,5 @@
> %  		bo = &vp->v_bufobj;
> %  		BO_LOCK(bo);
> % -		while (cnt > 0) {
> % +		while (!error && cnt > 0) {
> %  			struct buf *bp;
> % 
> % @@ -3894,5 +3928,5 @@
> %  					bremfree(bp);
> %  					bp->b_flags &= ~B_ASYNC;
> % -					bwrite(bp);
> % +					error = bwrite(bp);
> %  					++nfs_commit_miss;

I think you can just do a 'break' here without having to modify the while
condition.  This also seems to be an unrelated bugfix.

> %  				} else
> % Index: nfs_syscalls.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v
> % retrieving revision 1.119
> % diff -u -2 -r1.119 nfs_syscalls.c
> % --- nfs_syscalls.c	30 Jun 2008 20:43:06 -0000	1.119
> % +++ nfs_syscalls.c	2 Jul 2008 07:12:57 -0000
> % @@ -86,5 +86,4 @@
> %  int		nfsd_waiting = 0;
> %  int		nfsrv_numnfsd = 0;
> % -static int	notstarted = 1;
> % 
> %  static int	nfs_privport = 0;
> % @@ -448,7 +447,6 @@
> %  			    procrastinate = nfsrvw_procrastinate;
> %  			NFSD_UNLOCK();
> % -			if (writes_todo || (!(nd->nd_flag & ND_NFSV3) &&
> % -			    nd->nd_procnum == NFSPROC_WRITE &&
> % -			    procrastinate > 0 && !notstarted))
> % +			if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE &&
> % +			    procrastinate > 0))
> %  			    error = nfsrv_writegather(&nd, slp,
> %  				nfsd->nfsd_td, &mreq);

This no longer seems to be present in 8.

> I have forgotten many details but can dig them out from large private
> mails about this if you care.  It looks like most of the above is about
> getting the seqcount write; probably similar to what you did.  -current
> is still missing the error check for the bwrite() -- that's the only
> bwrite() in the whole server.  However, the fix isn't good -- one obvious
> bug is that a later successful bwrite() destroys the evidence of a
> previous bwrite() error.

One thing I had done was to use a separate set of heuristics for reading vs
writing.  However, that is possibly dubious (and we don't do it for local
I/O), so I can easily drop that feature if desired.

-- 
John Baldwin