Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Fri Aug 26 12:19:57 UTC 2011

John Baldwin wrote:
> I was doing some analysis of compiles over NFS at work recently and
> noticed
> from 'iostat 1' on the NFS server that all my NFS writes were always
> 16k
> writes (meaning that writes were never being clustered). I added some
> debugging sysctls to the NFS client and server code as well as the FFS
> write
> VOP to figure out the various kind of write requests that were being
> sent. I
> found that during the NFS compile, the NFS client was sending a lot of
> FILESYNC writes even though nothing in the compile process uses
> fsync().
> Based on the debugging I added, I found that all of the FILESYNC
> writes were
> marked as such because the buffer in question did not have B_ASYNC
> set:
> 
> 
> if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER))
> == B_ASYNC)
> iomode = NFSV3WRITE_UNSTABLE;
> else
> iomode = NFSV3WRITE_FILESYNC;
> 
> I eventually tracked this down to the code in the NFS client that
> pushes out a
> previous dirty region via 'bwrite()' when a write would dirty a
> non-contiguous
> region in the buffer:
> 
> if (bp->b_dirtyend > 0 &&
> (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
> if (bwrite(bp) == EINTR) {
> error = EINTR;
> break;
> }
> goto again;
> }
> 
Btw, the code was correct to use FILESYNC for this case.
Why? Well, if the b_dirtyoff, b_dirtyend are used by the "bottom half"
for the write/commit RPCs, the client won't know to re-write/commit
the range specified by b_dirtyoff/b_dirtyend after the range changes.
(ie. If the server crashes/reboots between the UNSTABLE write and the
 commit, the change will get lost.)

However, if you calculate the off, len arguments for the Commit RPC
to cover the entire block and not just b_dirtyoff->b_dirtyend, then
doing the write UNSTABLE should be fine. (Having the range larger than
the what was written should be ok. In fact the FreeBSD server ignores
the arguments and commits the entire file via VOP_FSYNC().)

There is this comment before the bwrite(), which I don't quite understand:
		/*
		 * If the new write will leave a contiguous dirty
		 * area, just update the b_dirtyoff and b_dirtyend,
		 * otherwise force a write rpc of the old dirty area.
		 *
		 * While it is possible to merge discontiguous writes due to
		 * our having a B_CACHE buffer ( and thus valid read data
		 * for the hole), we don't because it could lead to
		 * significant cache coherency problems with multiple clients,
		 * especially if locking is implemented later on.
		 *
I'm not sure what the author was referring to here, but the only case I can
think of is:
    If two clients are concurrently writing to the same file and have
    non-overlapping byte ranges within the same block write locked, they
    will assume that the other client won't write data into the block they
    have locked. For this case, writing the specific range of b_dirtyoff->b_dirtyend
    instead of the entire block. For this case, the following does seem to
    apply. (Btw, for this to work correctly, the client must write any dirty
    regions back to the server + re-read the block from the server whenever
    any byte within the block is write locked or it won't see the changes done
    by the other client. I'm not sure if the FreeBSD client does this?)
		 * as an optimization we could theoretically maintain
		 * a linked list of discontinuous areas, but we would still
		 * have to commit them separately so there isn't much
		 * advantage to it except perhaps a bit of asynchronization.
		 */
    I think keeping a list of discontinuous regions would be useful. It would
    allow the bwrite() below this comment to be deleted and, if the client is
    lucky, the discontinuous regions would get filled in by another write and
    become an entire block before a write-back was needed.

I think the correct first step is to finish off what you've already done by
changing the range calculation for the Commit RPC so that it uses the entire
block and not just b_dirtyoff->b_dirtyend. That way it is safe to do this write
UNSTABLE.

Then, I think it is worth doing a list of dirty byte ranges and hope that they
fill in to a full block, which is the only time the block can be clustered.
(Actually a byte range that starts at the beginning of the block can be clustered
 with previous blocks and a byte range that goes to the end of the block can
 be clustered with a subsequent block.)
And, as I meant to say before (but I'm not sure if it was clear), I don't think
the clustering will work until the "start the write immediately" is changed
to "mark the block dirty and write it sometime soon".
> (These writes are triggered during the compile of a file by the
> assembler
> seeking back into the file it has already written out to apply various
> fixups.)
> 
> From this I concluded that the test above is flawed. We should be
> using
> UNSTABLE writes for the writes above as the user has not requested
> them to
> be synchronous. The issue (I believe) is that the NFS client is
> overloading
> the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite()
> (or rather bawrite()) is not synchronously blocking to see if the
> request
> has completed. Instead, it is a "fire and forget". This is not the
> same
> thing as the IO_SYNC flag passed in ioflags during a write request
> which
> requests fsync()-like behavior.
It happens that, for this particular case, you need to do both a
bwrite() that waits for completion and FILESYNC. (Maybe that was why
the author had the code the way it is?)

However, as noted above, the change to UNSTABLE only requires that the
range calculation for commit cover the entire block instead of
b_dirtyoff->b_dirtyend.
> To disambiguate the two I added a new
> B_SYNC flag and changed the NFS clients to set this for write requests
> with IO_SYNC set. I then updated the condition above to instead check
> for
> B_SYNC being set rather than checking for B_ASYNC being clear.
> 
Yes, sounds good to me. (Overloaded flags are almost never a great
plan:-)

> That converted all the FILESYNC write RPCs from my builds into
> UNSTABLE
> write RPCs. The patch for that is at
> http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch.
> 
> However, even with this change I was still not getting clustered
> writes on
> the NFS server (all writes were still 16k). After digging around in
> the
> code for a bit I found that ffs will only cluster writes if the passed
> in
> 'ioflags' to ffs_write() specify a sequential hint. I then noticed
> that
> the NFS server has code to keep track of sequential I/O heuristics for
> reads, but not writes. I took the code from the NFS server's read op
> and moved it into a function to compute a sequential I/O heuristic
> that
> could be shared by both reads and writes. I also updated the
> sequential
> heuristic code to advance the counter based on the number of 16k
> blocks
> in each write instead of just doing ++ to match what we do for local
> file writes in sequential_heuristic() in vfs_vnops.c. Using this did
> give me some measure of NFS write clustering (though I can't peg my
> disks at MAXPHYS the way a dd to a file on a local filesystem can).
> The
> patch for these changes is at
> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
> 
> (This also fixes a bug in the new NFS server in that it wasn't
> actually
> clustering reads since it never updated nh->nh_nextr.)
> 
> Combining the two changes together gave me about a 1% reduction in
> wall
> time for my builds:
> 
> +------------------------------------------------------------------------------+
> |+ + ++ + +x++*x xx+x x x|
> |                 |___________A__|_M_______|_A____________| |
> +------------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 10 1869.62 1943.11 1881.89 1886.12 21.549724
> + 10 1809.71 1886.53 1869.26 1860.706 21.530664
> Difference at 95.0% confidence
> -25.414 +/- 20.2391
> -1.34742% +/- 1.07305%
> (Student's t, pooled s = 21.5402)
> 
> One caveat: I tested both of these patches on the old NFS client and
> server
> on 8.2-stable. I then ported the changes to the new client and server
> and
> while I made sure they compiled, I have not tested the new client and
> server.
> 
> --
> John Baldwin