Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Fri Aug 26 01:26:34 UTC 2011

John Baldwin wrote:
> I was doing some analysis of compiles over NFS at work recently and
> noticed
> from 'iostat 1' on the NFS server that all my NFS writes were always
> 16k
> writes (meaning that writes were never being clustered). I added some
> debugging sysctls to the NFS client and server code as well as the FFS
> write
> VOP to figure out the various kind of write requests that were being
> sent. I
> found that during the NFS compile, the NFS client was sending a lot of
> FILESYNC writes even though nothing in the compile process uses
> fsync().
> Based on the debugging I added, I found that all of the FILESYNC
> writes were
> marked as such because the buffer in question did not have B_ASYNC
> set:
> 
> 
> if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER))
> == B_ASYNC)
> iomode = NFSV3WRITE_UNSTABLE;
> else
> iomode = NFSV3WRITE_FILESYNC;
> 
> I eventually tracked this down to the code in the NFS client that
> pushes out a
> previous dirty region via 'bwrite()' when a write would dirty a
> non-contiguous
> region in the buffer:
> 
> if (bp->b_dirtyend > 0 &&
> (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
> if (bwrite(bp) == EINTR) {
> error = EINTR;
> break;
> }
> goto again;
> }
> 
Well I'm not sure this bwrite() is required in the current buffer cache
implementation. I far from understand the FreeBSD NFS buffer cache code (I just
cloned it for the new NFS client), but it seems to pre-read in the block
before writing it out, unless the write is of a full block.
If this is the case, doing multiple non-contiguous writes to the buffer
should be fine, I think?

Way back in the late 1980s I implemented NFS buffer cache code that worked
this way for writing:
- dirty a region of the buffer
- if another write that was contiguous with this region occurred, grow
  the dirty region
- else if a non-contiguous write occurred, write the dirty region out
   synchronously, and then write the new dirty region
- else if a read that was within the dirty region occurred, copy the
   data out of the buffer
- else if a read that wasn't within the dirty region occurred,
   - write the dirty region out
   - read in the entire block
   - copy the data from the block

Somewhere along the way, I believe this was converted to the Sun style
(which is more like a local file system) where:
- if a write is of a partial block
  - read the block in
  - modify the region of the block
  - mark the block dirty and start an asynchronous write of the block

with this model, all the data in the block is up-to-date, so it doesn't
really matter how many non-contiguous region(s) you modify, if you
write the entire block after the modification(s).
(For simplicity, I haven't worried about the EOF. Either model has to
 keep track of the file's size "np->n_size" and deal with where EOF is.)

I have a "hunch" that the above code snippet is left over from my old
model. (Which worked quite well for the old Portable C compiler and linker
of the 4BSD Vax era.) I don't think any write is necessary at that place
any more. (I also don't think it should need to keep track of b_dirtyoff,
b_dirtyend.)

Another thing that would be nice is to allow reads from the block while it
is being written back. (Writes either need to be blocked or handled so that
the block is still known to be dirty and needs to be written again, but reading
the data during the write back should be fine.) I don't know, but I don't
think that can happen now, given the way the buffers are locked during
writing.

> (These writes are triggered during the compile of a file by the
> assembler
> seeking back into the file it has already written out to apply various
> fixups.)
> 
> From this I concluded that the test above is flawed. We should be
> using
> UNSTABLE writes for the writes above as the user has not requested
> them to
> be synchronous.
It was synchronous (in the sense that it would wait until the write RPC
was completed) in the old model, since the new dirty region replaced the
old one. (As noted above, I don't think a bwrite() of any kind is needed
there now.)

> The issue (I believe) is that the NFS client is
> overloading
> the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite()
> (or rather bawrite()) is not synchronously blocking to see if the
> request
> has completed. Instead, it is a "fire and forget". This is not the
> same
> thing as the IO_SYNC flag passed in ioflags during a write request
> which
> requests fsync()-like behavior. To disambiguate the two I added a new
> B_SYNC flag and changed the NFS clients to set this for write requests
> with IO_SYNC set. I then updated the condition above to instead check
> for
> B_SYNC being set rather than checking for B_ASYNC being clear.
> 
Yes, in the bad old days B_ASYNC meant start the write RPC but don't
wait for it to complete. I think what you have said is correct above,
in that setting FILESYNC should be based on the IO_SYNC flag only.

> That converted all the FILESYNC write RPCs from my builds into
> UNSTABLE
> write RPCs. The patch for that is at
> http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch.
> 
> However, even with this change I was still not getting clustered
> writes on
> the NFS server (all writes were still 16k). After digging around in
> the
> code for a bit I found that ffs will only cluster writes if the passed
> in
> 'ioflags' to ffs_write() specify a sequential hint. I then noticed
> that
> the NFS server has code to keep track of sequential I/O heuristics for
> reads, but not writes. I took the code from the NFS server's read op
> and moved it into a function to compute a sequential I/O heuristic
> that
> could be shared by both reads and writes. I also updated the
> sequential
> heuristic code to advance the counter based on the number of 16k
> blocks
> in each write instead of just doing ++ to match what we do for local
> file writes in sequential_heuristic() in vfs_vnops.c. Using this did
> give me some measure of NFS write clustering (though I can't peg my
> disks at MAXPHYS the way a dd to a file on a local filesystem can).
> The
> patch for these changes is at
> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
> 
The above says you understand this stuff and I don't. However, I will note
that the asynchronous case, which starts the write RPC now, makes clustering
difficult and limits what you can do. (I think it was done in the bad old
days to avoid flooding the buffer cache and then having things pushing writes
back to get buffers. These days the buffer cache can be much bigger and it's
easy to create kernel threads to do write backs at appropriate times. As such,
I'd lean away from asynchronous (as in start the write now) and towards delayed
writes.

If the writes are delayed "bdwrite()" then I think it is much easier
to find contiguous dirty buffers to do as one write RPC. However, if
you just do bdwrite()s, there tends to be big bursts of write RPCs when
the syncer does its thing, unless kernel threads are working through the
cache doing write backs.

Since there are nfsiod threads, maybe these could scan for contiguous
dirty buffers and start big write RPCs for them? If there was some time
limit set for how long the buffer sits dirty before it gets a write started
for it, that would avoid a burst caused by the syncer.
Also, if you are lucky w.r.t. doing delayed writes for temporary files, the
file gets deleted before the write-back.

> (This also fixes a bug in the new NFS server in that it wasn't
> actually
> clustering reads since it never updated nh->nh_nextr.)
> 
Thanks, I have no idea what this is...

> Combining the two changes together gave me about a 1% reduction in
> wall
> time for my builds:
> 
> +------------------------------------------------------------------------------+
> |+ + ++ + +x++*x xx+x x x|
> |                 |___________A__|_M_______|_A____________| |
> +------------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 10 1869.62 1943.11 1881.89 1886.12 21.549724
> + 10 1809.71 1886.53 1869.26 1860.706 21.530664
> Difference at 95.0% confidence
> -25.414 +/- 20.2391
> -1.34742% +/- 1.07305%
> (Student's t, pooled s = 21.5402)
> 
> One caveat: I tested both of these patches on the old NFS client and
> server
> on 8.2-stable. I then ported the changes to the new client and server
> and
> while I made sure they compiled, I have not tested the new client and
> server.
> 
> --
> John Baldwin
Good stuff John. Sounds like you're having fun with it. I would like to
see it clustering using delayed writes and then doing the biggest write
RPCs the server will allow. (Solaris10 allows 1Mbyte write RPCs. At this
point, FreeBSD is limited to MAX_BSIZE, which is only 64K but hopefully
can be increased?)

rick