silly write caching in nfs3

Fri Feb 26 07:07:03 UTC 2016

nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
reasons today.

Writes have apparently always done silly caching.  Typical behaviour
is for iozone writing a 512MB file where the file fits in the buffer
cache/VMIO.  The write is cached perfectly.  But then when nfs_open()
reeopens the file, it calls vinvalbuf() to discard all of the cached
data.  Thus nfs write caching usually discards useful older data to
make space for newer data that will never be never used (unless the
file is opened r/w and read using the same fd (and is not accessed
for a setattr or advlock operation -- these call vinvalbuf() too, if
NMODIFIED)).  The discarding may be delayed for a long time.  Then
keeping the useless data causes even more older data to be discarded.
Discarding it on close would at least prevent further loss.  It would
have to be committed on close before discarding it of course.
Committing it on close does some good things even without discarding
there, and in oldnfs it gives a bug that prevents discaring in open --
see below.

nfs_open() does the discarding for different reasons in the NMODIFIED
and !NMODIFIED cases.  In the NMODIFED case, it discard unconditionally.
This case can be avoided by fsync() before close or setting the sysctl
to commit in close.  iozone does he fsync().  This helps in oldnfs but
not in newfs.  With it, iozone on newfs now behaves like it did on oldnfs
10-20 years ago.  Something (perhaps just the timestamp bugs discussed
later) "fixed" the discarding on oldnfs 5-10 years ago.

I think not committing in close is supposed to be an optimization, but
it is actually a pessimization for my kernel build tests (with object
files on nfs, which I normally avoid).  Builds certainly have to reopen
files after writing them, to link them and perhaps to install them.
This causes the discarding.  My kernel build tests also do a lot of
utimes() calls which cause the discarding before commit-on-close can
avoid the above cause for it it by clearing NMODIFIED.  Enabling
commit-on-close gives a small optimisation with oldnfs by avoiding all
of the discarding except for utimes().  It reduces read RPCs by about
25% without increasing write RPCs or real time.  It decreases real time
by a few percent.

The other reason for discarding is because the timestamps changed -- you
just wrote them, so the timestamps should have changed.  Different bugs
in comparing the timestamps gave different misbehaviours.

In old versions of FreeBSD and/or nfs, the timestamps had seconds
granularity, so many changes were missed.  This explains mysterious
behaviours by iozone 10-20 years ago: the write caching is seen to
work perfectly for most small total sizes, since all the writes take
less than 1 second so the timestamps usually don't change (but sometimes
the writes lie across a seconds boundary so the timestamps do change).

oldnfs was fixed many years ago to use timestamps with nanoseconds
resolution, but it doesn't suffer from the discarding in nfs_open()
in the !NMODIFIED case which is reached by either fsync() before close
of commit on close.  I think this is because it updates n_mtime to
the server's new timestamp in nfs_writerpc().  This seems to be wrong,
since the file might have been written to by other clients and then
the change would not be noticed until much later if ever (setting the
timestamp prevents seeing it change when it is checked later, but you
might be able to see another metadata change).

newfs has quite different code for nfs_writerpc().  Most of it was
moved to another function in nanother file.  I understand this even
less, but it doesn't seem to have fetch the server's new timestamp or
update n_mtime in the v3 case.

There are many other reasons why nfs is slower than in old versions.
One is that writes are more often done out of order.  This tends to
give a slowness factor of about 2 unless the server can fix up the
order.  I use an old server which can do the fixup for old clients but
not for newer clients starting in about FreeBSD-9 (or 7?).  I suspect
that this is just because Giant locking in old clients gave accidental
serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
for performance if you have multiple NICs serving multiple mounts.
Other cases are less clear.  For the iozone benchmark, there is only
1 stream and multiple nfsiod's pessimize it into multiple streams that
give buffers which arrive out of order on the server if the multiple
nfsiod's are actually active.  I use the following configuration to
ameliorate this, but the slowness factor is still often about 2 for
iozone:
- limit nfsd's to 4
- limit nfsiod's to 4
- limit nfs i/o sizes to 8K.  The server fs block size is 16K, and
   using a smaller block size usually helps by giving some delayed
   writes which can be clustered better.  (The non-nfs parts of the
   server could be smarter and do this intentionally.  The out-of-order
   buffers look like random writes to the server.)  16K i/o sizes
   otherwise work OK, but 32K i/o sizes are much slower for unknown
   reasons.

Bruce