Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Fri Aug 26 17:45:52 UTC 2011

On Thu, 25 Aug 2011, John Baldwin wrote:

> On Thursday, August 25, 2011 3:24:15 pm Bruce Evans wrote:
>> On Thu, 25 Aug 2011, John Baldwin wrote:
>>
>>> I was doing some analysis of compiles over NFS at work recently and noticed
>>> from 'iostat 1' on the NFS server that all my NFS writes were always 16k
>>> writes (meaning that writes were never being clustered).  I added some
>>
>> Did you see the old patches for this by Bjorn Gronwall?  They went through
>> many iterations.  He was mainly interested in the !async case and I was
>> mainly interested in the async case...
>
> Ah, no I had not seen these, thanks.

I looked at your patches after writing the above.  They look very similar,
but the details are intricate.  Unfortunately I forget most of the details.

I reran some simple benchmarks (just iozone on a very old (~5.2) nfs
client with various mount options, with netstat and systat to watch the
resulting i/o on the server) on 3 different servers (~5.2 with Bjorn's
patches, 8-current-2008 with Bjorn's patches, and -current-2011- March).
The old client has many throughput problems, but strangely most of
them are fixed by changing the server.

>>> and moved it into a function to compute a sequential I/O heuristic that
>>> could be shared by both reads and writes.  I also updated the sequential
>>> heuristic code to advance the counter based on the number of 16k blocks
>>> in each write instead of just doing ++ to match what we do for local
>>> file writes in sequential_heuristic() in vfs_vnops.c.  Using this did
>>> give me some measure of NFS write clustering (though I can't peg my
>>> disks at MAXPHYS the way a dd to a file on a local filesystem can).  The
>>
>> I got close to it.  The failure modes were mostly burstiness of i/o, where
>> the server buffer cache seemed to fill up so the client would stop sending
>> and stay stopped for too long (several seconds; enough to reduce the
>> throughput by 40-60%).
>
> Hmm, I can get writes up to around 40-50k, but not 128k.  My test is to just
> dd from /dev/zero to a file on the NFS client using a blocksize of 64k or so.

I get mostly over 60K with old ata drivers that have a limit of 64K and
mostly over 128K with not so old ata drivers that have a limit of 128.
This is almost independent of the nfs client and server versions and
mount options.

I mostly tested async mounts, and mostly with an i/o size of just 512 for
iozone (old-iozone 1024 512).  It actually helps a little to have a
minimal i/o size at the syscall level (to minimize latency at other
levels; depends on CPU keeping up and kernel reblocking to better sizes).

Throughputs with client defaults (-U,-r8192(?),-w8192(?), async,noatime)
in 1e9 bytes were approximately:

                  write  read
     local disk:    48    53
     5.2 server:    46    39     some bug usually makes the read direction slow
     8   server:    46    39
     cur server:    32    50+(?) writes 2/3 as fast due to not having patches
                                 but reads fixed (may also require tcp)

Async on the server makes little difference.  Contrary to what I said before,
async on the client makes a big difference (it controls FILESYNC in a
critical place).  Now with noasync on the client:

     5.2 server:    15
     8   server:    similar
     cur server:    similar (worse, but not nearly 3/2 slower IIRC)

There are just too many sync writes without async.  But this is apparently
mostly due to the default udp r/w sizes being too small, since tcp does
much better, I think only due to its larger r/w sizes (I mostly don't
use it because it has worse latency and more bugs in old nfs clients).

Now with noasync,-T[-r32768(?),-w(32768)] on the client:

     5.2 server:    34    37
     8   server:    40+   (?)
     cur server:    not tested

The improvement is much larger for 8-server than for 5.2-server.  That
might be due to better tcp support, but I fear it is because 8-server
is missing my fixes for ffs_update().  (The server file system was
always ffs mounted async.  Long ago, I got dyson to make fsync sort
of work even when the file system is mounted async.  VOP_FSYNC() writes
data but not directory entries or inodes, except in my version it
writes inodes.  But actually writing the inode for every nfs FILESYNC
probably doubles the number of i/o's.  This is ameliorated as usual
by a large i/o size at all levels, and by the disk lieing about actually
writing the data so that doubling the number of writes doesn't give a
full 2 times slowdown (I use old low end ATA disks with write caching
enabled).)

Now with async,-T[-r32768(?),-w(32768)] on the client:

     5.2 server:    37    40     example of tcp not working well with 5.2
     8   server:    not carefully tested (similar to -U)
     cur server:    not carefully tested (similar to -U)

In other tests, toggling tcp/ucp and changing the block sizes makes
hard to explain but not very important differences.  It only magically
fixes the case of an async client.  My LAN uses a cheap switch but works
almost perfectly for nfs over udp.

I now remember that Bjorn was most interested in improving clustering
for the noasync case.  Clustering should happen almost automatically
for the async case.  Then lots of async writes should accumulate on
the server and be written by a large cluster write.  Any clustering
at the nfs level would just get in the way.  For the noasync case,
FILESYNC will get in the way whenever it happens and it happens a
lot, so I'm not sure how the server much opportunity for clustering.

>>> patch for these changes is at
>>> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
>>>
>>> (This also fixes a bug in the new NFS server in that it wasn't actually
>>> clustering reads since it never updated nh->nh_nextr.)

I'm still looking for the bug that makes reads slower.  It doesn't seem
to be clustering.

>> Here is the version of Bjorn's patches that I last used (in 8-current in
>> 2008):
>>
>> % Index: nfs_serv.c
>> % ===================================================================
>> % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v
>> % retrieving revision 1.182
>> % diff -u -2 -r1.182 nfs_serv.c
>> % --- nfs_serv.c	28 May 2008 16:23:17 -0000	1.182
>> % +++ nfs_serv.c	1 Jun 2008 05:52:45 -0000
>> ...
>> % +	/*
>> % +	 * Locate best nfsheur[] candidate using double hashing.
>> % +	 */
>> % +
>> % +	hi =   NH_TAG(vp) % NUM_HEURISTIC;
>> % +	step = NH_TAG(vp) & HASH_MAXSTEP;
>> % +	step++;			/* Step must not be zero. */
>> % +	nh = &nfsheur[hi];
>
> I can't speak to whether using a variable step makes an appreciable
> difference.  I have not examined that in detail in my tests.

Generally, only small differences can be made by tuning hash methods.

>> % +	/*
>> % +	 * Calculate heuristic
>> % +	 */
>> % +
>> % +	lblocksize = vp->v_mount->mnt_stat.f_iosize;
>> % +	nblocks = howmany(uio->uio_resid, lblocksize);
>
> This is similar to what I pulled out of sequential_heuristic() except
> that it doesn't hardcode 16k.  There is a big comment above the 16k
> that says it isn't about the blocksize though, so I'm not sure which is
> most correct.  I imagine we'd want to use the same strategy in both places
> though.  Comment from vfs_vnops.c:
>
> 		/*
> 		 * f_seqcount is in units of fixed-size blocks so that it
> 		 * depends mainly on the amount of sequential I/O and not
> 		 * much on the number of sequential I/O's.  The fixed size
> 		 * of 16384 is hard-coded here since it is (not quite) just
> 		 * a magic size that works well here.  This size is more
> 		 * closely related to the best I/O size for real disks than
> 		 * to any block size used by software.
> 		 */
> 		fp->f_seqcount += howmany(uio->uio_resid, 16384);

Probably this doesn't matter.  The above code in vfs_vnops.c is mostly
by me.  I think it is newer than the code in nfs_serv.c (strictly older,
but nfs_serv.c has not caught up with it).  I played a bit more with this
in nfs_serv.c, to see if this should be different in nfs.  In my local
version, lblocksize can be set by a sysctl.  But I only used this sysctl
for testing, and don't remember it making any interesting differences.

>> % +	if (uio->uio_offset == nh->nh_nextoff) {
>> % +		nh->nh_seqcount += nblocks;
>> % +		if (nh->nh_seqcount > IO_SEQMAX)
>> % +			nh->nh_seqcount = IO_SEQMAX;
>> % +	} else if (uio->uio_offset == 0) {
>> % +		/* Seek to beginning of file, ignored. */
>> % +	} else if (qabs(uio->uio_offset - nh->nh_nextoff) <=
>> % +		   MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) {
>> % +		nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */
>
> Ah, this is a nice touch!  I had noticed reordered I/O's resetting my
> clustered I/O count.  I should try this extra step.

Stats after a few GB of i/o:

% vfs.nfsrv.commit_blks: 138037
% vfs.nfsrv.commit_miss: 2844
% vfs.nfsrv.reordered_io: 5170
% vfs.nfsrv.realign_test: 492003
% vfs.nfsrv.realign_count: 0

There were only a few reorderings.  In old testing, I seemed to get best
results by turning the number of nfsd's down to 1.  I don't use this in
production.  I turn the number of nfsiod's down to 4 in production.

>> % +	} else
>> % +		nh->nh_seqcount /= 2; /* Not sequential access. */
>
> Hmm, this is a bit different as well.  sequential_heuristic() just
> drops all clustering (seqcount = 1) here so I had followed that.  I do
> wonder if this change would be good for "normal" I/O as well?  (Again,
> I think it would do well to have "normal" I/O and NFS generally use
> the same algorithm, but perhaps with the extra logic to handle reordered
> writes more gracefully for NFS.)

I don't know much about this.

>> % +
>> % +	nh->nh_nextoff = uio->uio_offset + uio->uio_resid;
>
> Interesting.  So this assumes the I/O never fails.

Not too good.  Some places like ffs_write() back out of failing i/o's,
but I think they reduce ui_offset before the corresponding code for
the non-nfs heuristic in vn_read/write() advances f_nextoff.

>> % @@ -1225,4 +1251,5 @@
>> %  	vn_finished_write(mntp);
>> %  	VFS_UNLOCK_GIANT(vfslocked);
>> % +	bwillwrite();	    /* After VOP_WRITE to avoid reordering. */
>> %  	return(error);
>> %  }
>
> Hmm, this seems to be related to avoiding overloading the NFS server's
> buffer cache?

Just to avoid spurious reordering I think.

Is this all still Giant locked?  Giant might either reduce or increase
interference between nfsd's, depending on the timing.

>> ...
>> % Index: nfs_syscalls.c
>> % ===================================================================
>> % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v
>> % retrieving revision 1.119
>> % diff -u -2 -r1.119 nfs_syscalls.c
>> % --- nfs_syscalls.c	30 Jun 2008 20:43:06 -0000	1.119
>> % +++ nfs_syscalls.c	2 Jul 2008 07:12:57 -0000
>> % @@ -86,5 +86,4 @@
>> %  int		nfsd_waiting = 0;
>> %  int		nfsrv_numnfsd = 0;
>> % -static int	notstarted = 1;
>> %
>> %  static int	nfs_privport = 0;
>> % @@ -448,7 +447,6 @@
>> %  			    procrastinate = nfsrvw_procrastinate;
>> %  			NFSD_UNLOCK();
>> % -			if (writes_todo || (!(nd->nd_flag & ND_NFSV3) &&
>> % -			    nd->nd_procnum == NFSPROC_WRITE &&
>> % -			    procrastinate > 0 && !notstarted))
>> % +			if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE &&
>> % +			    procrastinate > 0))
>> %  			    error = nfsrv_writegather(&nd, slp,
>> %  				nfsd->nfsd_td, &mreq);
>
> This no longer seems to be present in 8.

nfs_syscalls.c seems to have been replaced by nfs_srvkrpc.c.  All history
has been lost (obscured), but the code is quite different so a repo-copy
wouldn't have worked much better.  This created lots of garbage if not
bugs:
- the nfsrv.gathererdelay and nfsrv.gatherdelay_v3 sysctls are now in
   nfs_srvkrpc.c.  They were already hard to associate with any effects,
   since their variables names don't match their sysctl names.  The
   variables are named nfsrv_procrastinate and nfsrv_procrastinate_v3.
- the *procrastinate* global variables are still declared in nfs.h and
   initialized to defaults in nfs_serv.c, but are no longer really used.
- the local variable `procrastinate' and the above code to use it no
   longer exist
- the macro for the default for the non-v3 sysctl, NFS_GATHERDELAY, is
   still defined in nfs.h, but is only used in the dead initialization.
- the new nfs server doesn't have any gatherdelay or procrastinate
   symbols.

Bjorn said that gatherdelay_v3 didn't work, and tried to fix it.  The
above is the final result that I have.  I now remember trying this.
Bjorn hoped that a nonzero gatherdelay would reduce reordering, but
in practice it just reduces performance by waiting too long.  Its
default of 10 msec may have worked with 1 Mpbs ethernet, but can't
possibly scale to 1 Gbps.  ISTR that the value had to be very small,
perhaps 100 usec, for the delay not to be too large, but when it is
that small it has problems having any effects except to waste CPU
in a different way than delaying.

> One thing I had done was to use a separate set of heuristics for reading vs
> writing.  However, that is possibly dubious (and we don't do it for local
> I/O), so I can easily drop that feature if desired.

I think it is unlikely to make much difference.  The heuristic always
has to cover a very wide range of access patterns.

Bruce