silly write caching in nfs3

Sat Feb 27 04:21:16 UTC 2016

On Fri, 26 Feb 2016, Bruce Evans wrote:

> nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
> reasons today.
> ...
> oldnfs was fixed many years ago to use timestamps with nanoseconds
> resolution, but it doesn't suffer from the discarding in nfs_open()
> in the !NMODIFIED case which is reached by either fsync() before close
> of commit on close.  I think this is because it updates n_mtime to
> the server's new timestamp in nfs_writerpc().  This seems to be wrong,
> since the file might have been written to by other clients and then
> the change would not be noticed until much later if ever (setting the
> timestamp prevents seeing it change when it is checked later, but you
> might be able to see another metadata change).
>
> newfs has quite different code for nfs_writerpc().  Most of it was
> moved to another function in nanother file.  I understand this even
> less, but it doesn't seem to have fetch the server's new timestamp or
> update n_mtime in the v3 case.

This quick fix seems to give the same behaviour as in oldnfs.  It also
fixes some bugs in comments in nfs_fsync() (where I tried to pass a
non-null cred, but none is available.  The ARGSUSED bug is in many
other functions):

X Index: nfs_clvnops.c
X ===================================================================
X --- nfs_clvnops.c	(revision 296089)
X +++ nfs_clvnops.c	(working copy)
X @@ -1425,6 +1425,23 @@
X  	}
X  	if (DOINGASYNC(vp))
X  		*iomode = NFSWRITE_FILESYNC;
X +	if (error == 0 && NFS_ISV3(vp)) {
X +		/*
X +		 * Break seeing concurrent changes by other clients,
X +		 * since without this the next nfs_open() would
X +		 * invalidate our write buffers.  This is worse than
X +		 * useless unless the write is committed on close or
X +		 * fsynced, since otherwise NMODIFIED remains set so
X +		 * the next nfs_open() will still invalidate the write
X +		 * buffers.  Unfortunately, this cannot be placed in
X +		 * ncl_flush() where NMODIFIED is cleared since
X +		 * credentials are unavailable there for at least
X +		 * calls by nfs_fsync().
X +		 */
X +		mtx_lock(&(VTONFS(vp))->n_mtx);
X +		VTONFS(vp)->n_mtime = nfsva.na_mtime;
X +		mtx_unlock(&(VTONFS(vp))->n_mtx);
X +	}
X  	if (error && NFS_ISV4(vp))
X  		error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0);
X  	return (error);
X @@ -2613,9 +2630,8 @@
X  }
X 
X  /*
X - * fsync vnode op. Just call ncl_flush() with commit == 1.
X + * fsync vnode op.
X   */
X -/* ARGSUSED */
X  static int
X  nfs_fsync(struct vop_fsync_args *ap)
X  {
X @@ -2622,8 +2638,12 @@
X 
X  	if (ap->a_vp->v_type != VREG) {
X  		/*
X +		 * XXX: this comment is misformatted (after fixing its
X +		 * internal errors) and misplaced.
X +		 *
X  		 * For NFS, metadata is changed synchronously on the server,
X -		 * so there is nothing to flush. Also, ncl_flush() clears
X +		 * so the only thing to flush is data for regular files.
X +		 * Also, ncl_flush() clears
X  		 * the NMODIFIED flag and that shouldn't be done here for
X  		 * directories.
X  		 */

> There are many other reasons why nfs is slower than in old versions.
> One is that writes are more often done out of order.  This tends to
> give a slowness factor of about 2 unless the server can fix up the
> order.  I use an old server which can do the fixup for old clients but
> not for newer clients starting in about FreeBSD-9 (or 7?).  I suspect
> that this is just because Giant locking in old clients gave accidental
> serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
> for performance if you have multiple NICs serving multiple mounts.
> Other cases are less clear.  For the iozone benchmark, there is only
> 1 stream and multiple nfsiod's pessimize it into multiple streams that
> give buffers which arrive out of order on the server if the multiple
> nfsiod's are actually active.  I use the following configuration to
> ameliorate this, but the slowness factor is still often about 2 for
> iozone:
> - limit nfsd's to 4
> - limit nfsiod's to 4
> - limit nfs i/o sizes to 8K.  The server fs block size is 16K, and
>  using a smaller block size usually helps by giving some delayed
>  writes which can be clustered better.  (The non-nfs parts of the
>  server could be smarter and do this intentionally.  The out-of-order
>  buffers look like random writes to the server.)  16K i/o sizes
>  otherwise work OK, but 32K i/o sizes are much slower for unknown
>  reasons.

Size 16K seems to work better now.

I also use:

- turn off most interrupt moderation.  This reduces (ping) latency from
   ~125 usec to ~75 usec for em on PCIe (after already turning off interrupt
   moderation on the server to reduce it from 150-200 usec).  75 usec
   is still a lot, though it is about 3 times lower than the default
   misconfiguration.  Downgrading up to older lem on PCI/33 reduces it to
   52.  Downgrading to DEVICE_POLLING reduces it to about 40.  The
   dowgrades are upgrades :-(.  Not using a switch reduces it by about
   another 20.

   Low latency important for small i/o's.  I was suprised that it also
   helps a lot for large i/o's.  Apparently it changes the timing enough
   to reduce the out-of-order buffers significantly.

The default misconfiguration with 20 nfsiod's is worse than I expected
(on an 8 core system).  For (old) "iozone auto" which starts with a file
size of 1MB, the write speed is about 2MB/sec with 20 nfsiod's and
22 MB/sec with 1 nfsiod.  2-4 nfsiod's work best.  They give 30-40MB/sec
for most file sizes.  Apparently, with 20 nfsiod's the write of 1MB is
split up into almost twenty pieces of 50K each (6 or 7 8K buffers each),
and the final order is perhaps even worse than random.  I think it is
basically sequential with about <number of nfsiods> seeks for all file
sizes between 1MB and many MB.

I also use:

- no PREEMPTION and no IPI_PREEMPTION on SMP systems.  This limits context
   switching.
- no SCHED_ULE.  HZ = 100.  This also limits context switching.

With more or fairer context switching, all nfsiods are more likely to run,
causing more damage.

More detailed result for iozone 1 65536 with nfsiodmax=64 and oldnfs and
mostly best known other tuning:

- first run write speed 2MB/S (probably still using 20)
   (all rates use disk marketing MB)
- second run 9MB/S
- after repeated runs, 250MB/S
- the speed kept mostly dropping, and reached 21K/S
- server stats for next run at 29K/S: 139 blocks tested and order of
   24 fixed (the server has an early version of what is in -current,
   with more debugging)

with nfsiodmax=20:
- most runs 2-2.2MB/S; one at 750K/S
- server stats for a run at 2.2MB/S: 135 blocks tested and 86 fixed

with nfsiodmax=4:
- 5.8-6.5MB/S
- server stats for a run at 6.0MB/S: 135 blocks tested and 0 fixed

with nfsiodmax=2:
- 4.8-5.2MB/S
- server stats for a run at 5.1MB/S: 138 blocks tested and 0 fixed

with nfsiodmax=1:
- 3.4MB/S
- server stats: 138 blocks tested and 0 fixed

For iozone 512 65536:

with nfsiodmax=1:
- 34.7MB/S
- server stats: 65543 blocks tested and 0 fixed

with nfsiodmax=2:
- 45.9MB/S (this is close to the drive's speed and faster than direct on the
   server.  It is faster because everything the clustering accidentally works
   better)
- server stats: 65550 blocks tested and 578 fixed

with nfsiodmax=4:
- 45.6MB/S
- server stats: 65550 blocks tested and 2067 fixed

with nfsiodmax=20:
- 21.4MB/S
- server stats: 65576 blocks tested and 12057 fixed
   (it is easy to see how 7 nfsiods could give 1/7 = 14% of blocks
   out of order.  The server is fixing up almost 20%, but that is
   not enough)

with nfsiodmax=64 (caused server to not respond):
- test aborted at 500+MB
- server stats: about 10000 blocks fixed

with nfsiodmax=64 again:
- 9.6MB/S
- server stats: 65598 blocks tested and 14034 fixed

The nfsiod's get scheduled almost equally.

Bruce