Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

John Baldwin jhb at freebsd.org
Thu Aug 25 17:47:47 UTC 2011


I was doing some analysis of compiles over NFS at work recently and noticed
from 'iostat 1' on the NFS server that all my NFS writes were always 16k
writes (meaning that writes were never being clustered).  I added some
debugging sysctls to the NFS client and server code as well as the FFS write
VOP to figure out the various kind of write requests that were being sent.  I
found that during the NFS compile, the NFS client was sending a lot of
FILESYNC writes even though nothing in the compile process uses fsync().
Based on the debugging I added, I found that all of the FILESYNC writes were
marked as such because the buffer in question did not have B_ASYNC set:


		if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) == B_ASYNC)
		    iomode = NFSV3WRITE_UNSTABLE;
		else
		    iomode = NFSV3WRITE_FILESYNC;

I eventually tracked this down to the code in the NFS client that pushes out a
previous dirty region via 'bwrite()' when a write would dirty a non-contiguous
region in the buffer:

		if (bp->b_dirtyend > 0 &&
		    (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
			if (bwrite(bp) == EINTR) {
				error = EINTR;
				break;
			}
			goto again;
		}

(These writes are triggered during the compile of a file by the assembler
seeking back into the file it has already written out to apply various
fixups.)

From this I concluded that the test above is flawed.  We should be using
UNSTABLE writes for the writes above as the user has not requested them to
be synchronous.  The issue (I believe) is that the NFS client is overloading
the B_ASYNC flag.  The B_ASYNC flag means that the caller of bwrite()
(or rather bawrite()) is not synchronously blocking to see if the request
has completed.  Instead, it is a "fire and forget".  This is not the same
thing as the IO_SYNC flag passed in ioflags during a write request which
requests fsync()-like behavior.  To disambiguate the two I added a new
B_SYNC flag and changed the NFS clients to set this for write requests
with IO_SYNC set.  I then updated the condition above to instead check for
B_SYNC being set rather than checking for B_ASYNC being clear.

That converted all the FILESYNC write RPCs from my builds into UNSTABLE
write RPCs.  The patch for that is at
http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch.

However, even with this change I was still not getting clustered writes on
the NFS server (all writes were still 16k).  After digging around in the
code for a bit I found that ffs will only cluster writes if the passed in
'ioflags' to ffs_write() specify a sequential hint.  I then noticed that
the NFS server has code to keep track of sequential I/O heuristics for
reads, but not writes.  I took the code from the NFS server's read op
and moved it into a function to compute a sequential I/O heuristic that
could be shared by both reads and writes.  I also updated the sequential
heuristic code to advance the counter based on the number of 16k blocks
in each write instead of just doing ++ to match what we do for local
file writes in sequential_heuristic() in vfs_vnops.c.  Using this did
give me some measure of NFS write clustering (though I can't peg my
disks at MAXPHYS the way a dd to a file on a local filesystem can).  The
patch for these changes is at
http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch

(This also fixes a bug in the new NFS server in that it wasn't actually
clustering reads since it never updated nh->nh_nextr.)

Combining the two changes together gave me about a 1% reduction in wall
time for my builds:

+------------------------------------------------------------------------------+
|+                   +     ++    + +x++*x  xx+x    x                          x|
|                 |___________A__|_M_______|_A____________|                    |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10       1869.62       1943.11       1881.89       1886.12     21.549724
+  10       1809.71       1886.53       1869.26      1860.706     21.530664
Difference at 95.0% confidence
        -25.414 +/- 20.2391
        -1.34742% +/- 1.07305%
        (Student's t, pooled s = 21.5402)

One caveat: I tested both of these patches on the old NFS client and server
on 8.2-stable.  I then ported the changes to the new client and server and
while I made sure they compiled, I have not tested the new client and server.

-- 
John Baldwin


More information about the freebsd-fs mailing list