silly write caching in nfs3

Rick Macklem rmacklem at
Sat Feb 27 04:00:50 UTC 2016

Bruce Evans wrote:
> nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
> reasons today.
> Writes have apparently always done silly caching.  Typical behaviour
> is for iozone writing a 512MB file where the file fits in the buffer
> cache/VMIO.  The write is cached perfectly.  But then when nfs_open()
> reeopens the file, it calls vinvalbuf() to discard all of the cached
> data.  Thus nfs write caching usually discards useful older data to
> make space for newer data that will never be never used (unless the
> file is opened r/w and read using the same fd (and is not accessed
> for a setattr or advlock operation -- these call vinvalbuf() too, if
> NMODIFIED)).  The discarding may be delayed for a long time.  Then
> keeping the useless data causes even more older data to be discarded.
> Discarding it on close would at least prevent further loss.  It would
> have to be committed on close before discarding it of course.
> Committing it on close does some good things even without discarding
> there, and in oldnfs it gives a bug that prevents discaring in open --
> see below.
> nfs_open() does the discarding for different reasons in the NMODIFIED
> and !NMODIFIED cases.  In the NMODIFED case, it discard unconditionally.
> This case can be avoided by fsync() before close or setting the sysctl
> to commit in close.  iozone does he fsync().  This helps in oldnfs but
> not in newfs.  With it, iozone on newfs now behaves like it did on oldnfs
> 10-20 years ago.  Something (perhaps just the timestamp bugs discussed
> later) "fixed" the discarding on oldnfs 5-10 years ago.
> I think not committing in close is supposed to be an optimization, but
> it is actually a pessimization for my kernel build tests (with object
> files on nfs, which I normally avoid).  Builds certainly have to reopen
> files after writing them, to link them and perhaps to install them.
> This causes the discarding.  My kernel build tests also do a lot of
> utimes() calls which cause the discarding before commit-on-close can
> avoid the above cause for it it by clearing NMODIFIED.  Enabling
> commit-on-close gives a small optimisation with oldnfs by avoiding all
> of the discarding except for utimes().  It reduces read RPCs by about
> 25% without increasing write RPCs or real time.  It decreases real time
> by a few percent.
> The other reason for discarding is because the timestamps changed -- you
> just wrote them, so the timestamps should have changed.  Different bugs
> in comparing the timestamps gave different misbehaviours.
You could easily test to see if second-resolution timestamps make a
difference by redefining the NFS_TIMESPEC_COMPARE() macro
{ in sys/fs/nfsclient/nfsnode.h } so that it only compares the
tv_sec field and not the tv_nsec field.
--> Then the client would only think the mtime has changed when tv_sec


> In old versions of FreeBSD and/or nfs, the timestamps had seconds
> granularity, so many changes were missed.  This explains mysterious
> behaviours by iozone 10-20 years ago: the write caching is seen to
> work perfectly for most small total sizes, since all the writes take
> less than 1 second so the timestamps usually don't change (but sometimes
> the writes lie across a seconds boundary so the timestamps do change).
> oldnfs was fixed many years ago to use timestamps with nanoseconds
> resolution, but it doesn't suffer from the discarding in nfs_open()
> in the !NMODIFIED case which is reached by either fsync() before close
> of commit on close.  I think this is because it updates n_mtime to
> the server's new timestamp in nfs_writerpc().  This seems to be wrong,
> since the file might have been written to by other clients and then
> the change would not be noticed until much later if ever (setting the
> timestamp prevents seeing it change when it is checked later, but you
> might be able to see another metadata change).
> newfs has quite different code for nfs_writerpc().  Most of it was
> moved to another function in nanother file.  I understand this even
> less, but it doesn't seem to have fetch the server's new timestamp or
> update n_mtime in the v3 case.
> There are many other reasons why nfs is slower than in old versions.
> One is that writes are more often done out of order.  This tends to
> give a slowness factor of about 2 unless the server can fix up the
> order.  I use an old server which can do the fixup for old clients but
> not for newer clients starting in about FreeBSD-9 (or 7?).  I suspect
> that this is just because Giant locking in old clients gave accidental
> serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
> for performance if you have multiple NICs serving multiple mounts.
> Other cases are less clear.  For the iozone benchmark, there is only
> 1 stream and multiple nfsiod's pessimize it into multiple streams that
> give buffers which arrive out of order on the server if the multiple
> nfsiod's are actually active.  I use the following configuration to
> ameliorate this, but the slowness factor is still often about 2 for
> iozone:
> - limit nfsd's to 4
> - limit nfsiod's to 4
> - limit nfs i/o sizes to 8K.  The server fs block size is 16K, and
>    using a smaller block size usually helps by giving some delayed
>    writes which can be clustered better.  (The non-nfs parts of the
>    server could be smarter and do this intentionally.  The out-of-order
>    buffers look like random writes to the server.)  16K i/o sizes
>    otherwise work OK, but 32K i/o sizes are much slower for unknown
>    reasons.
> Bruce
> _______________________________________________
> freebsd-fs at mailing list
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at"

More information about the freebsd-fs mailing list