Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Thu Aug 25 20:45:50 UTC 2011

Hi John,

   This is an interesting fix. If I can I'll try patching a few systems
and giving it a try.

   I don't know if this would help for timing comparisons, but years
ago we used to run build work directly against our NFS storage. In
general, we moved away from that to a two stage approach:

cc foo.c -o /tmp/foo.o  # where /tmp is a memory filesystem
cp /tmp/foo.o /nfs/mounted/target/area/foo.o

   This provided for a very large performance boost. It's worth noting
that different compilers require different levels of arm-wrestling to
convince them to use the file specifed with -o correctly (and directly).

   With a simple .mk file change you could probably get an up-to-date
comparison of the current system vs your patch vs sequential i/o only.

   I'll let you know what I find and if we see any regressions.

Thanks,
John

----- John Baldwin's Original Message -----
> I was doing some analysis of compiles over NFS at work recently and noticed
> from 'iostat 1' on the NFS server that all my NFS writes were always 16k
> writes (meaning that writes were never being clustered).  I added some
> debugging sysctls to the NFS client and server code as well as the FFS write
> VOP to figure out the various kind of write requests that were being sent.  I
> found that during the NFS compile, the NFS client was sending a lot of
> FILESYNC writes even though nothing in the compile process uses fsync().
> Based on the debugging I added, I found that all of the FILESYNC writes were
> marked as such because the buffer in question did not have B_ASYNC set:
> 
> 
> 		if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) == B_ASYNC)
> 		    iomode = NFSV3WRITE_UNSTABLE;
> 		else
> 		    iomode = NFSV3WRITE_FILESYNC;
> 
> I eventually tracked this down to the code in the NFS client that pushes out a
> previous dirty region via 'bwrite()' when a write would dirty a non-contiguous
> region in the buffer:
> 
> 		if (bp->b_dirtyend > 0 &&
> 		    (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
> 			if (bwrite(bp) == EINTR) {
> 				error = EINTR;
> 				break;
> 			}
> 			goto again;
> 		}
> 
> (These writes are triggered during the compile of a file by the assembler
> seeking back into the file it has already written out to apply various
> fixups.)
> 
> From this I concluded that the test above is flawed.  We should be using
> UNSTABLE writes for the writes above as the user has not requested them to
> be synchronous.  The issue (I believe) is that the NFS client is overloading
> the B_ASYNC flag.  The B_ASYNC flag means that the caller of bwrite()
> (or rather bawrite()) is not synchronously blocking to see if the request
> has completed.  Instead, it is a "fire and forget".  This is not the same
> thing as the IO_SYNC flag passed in ioflags during a write request which
> requests fsync()-like behavior.  To disambiguate the two I added a new
> B_SYNC flag and changed the NFS clients to set this for write requests
> with IO_SYNC set.  I then updated the condition above to instead check for
> B_SYNC being set rather than checking for B_ASYNC being clear.
> 
> That converted all the FILESYNC write RPCs from my builds into UNSTABLE
> write RPCs.  The patch for that is at
> http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch.
> 
> However, even with this change I was still not getting clustered writes on
> the NFS server (all writes were still 16k).  After digging around in the
> code for a bit I found that ffs will only cluster writes if the passed in
> 'ioflags' to ffs_write() specify a sequential hint.  I then noticed that
> the NFS server has code to keep track of sequential I/O heuristics for
> reads, but not writes.  I took the code from the NFS server's read op
> and moved it into a function to compute a sequential I/O heuristic that
> could be shared by both reads and writes.  I also updated the sequential
> heuristic code to advance the counter based on the number of 16k blocks
> in each write instead of just doing ++ to match what we do for local
> file writes in sequential_heuristic() in vfs_vnops.c.  Using this did
> give me some measure of NFS write clustering (though I can't peg my
> disks at MAXPHYS the way a dd to a file on a local filesystem can).  The
> patch for these changes is at
> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
> 
> (This also fixes a bug in the new NFS server in that it wasn't actually
> clustering reads since it never updated nh->nh_nextr.)
> 
> Combining the two changes together gave me about a 1% reduction in wall
> time for my builds:
> 
> +------------------------------------------------------------------------------+
> |+                   +     ++    + +x++*x  xx+x    x                          x|
> |                 |___________A__|_M_______|_A____________|                    |
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  10       1869.62       1943.11       1881.89       1886.12     21.549724
> +  10       1809.71       1886.53       1869.26      1860.706     21.530664
> Difference at 95.0% confidence
>         -25.414 +/- 20.2391
>         -1.34742% +/- 1.07305%
>         (Student's t, pooled s = 21.5402)
> 
> One caveat: I tested both of these patches on the old NFS client and server
> on 8.2-stable.  I then ported the changes to the new client and server and
> while I made sure they compiled, I have not tested the new client and server.
> 
> -- 
> John Baldwin