silly write caching in nfs3

Sat Feb 27 03:51:52 UTC 2016

Bruce Evans wrote:
> nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
> reasons today.
> 
> Writes have apparently always done silly caching.  Typical behaviour
> is for iozone writing a 512MB file where the file fits in the buffer
> cache/VMIO.  The write is cached perfectly.  But then when nfs_open()
> reeopens the file, it calls vinvalbuf() to discard all of the cached
> data.  Thus nfs write caching usually discards useful older data to
> make space for newer data that will never be never used (unless the
> file is opened r/w and read using the same fd (and is not accessed
> for a setattr or advlock operation -- these call vinvalbuf() too, if
> NMODIFIED)).  The discarding may be delayed for a long time.  Then
> keeping the useless data causes even more older data to be discarded.
> Discarding it on close would at least prevent further loss.  It would
> have to be committed on close before discarding it of course.
> Committing it on close does some good things even without discarding
> there, and in oldnfs it gives a bug that prevents discaring in open --
> see below.
> 
> nfs_open() does the discarding for different reasons in the NMODIFIED
> and !NMODIFIED cases.  In the NMODIFED case, it discard unconditionally.
> This case can be avoided by fsync() before close or setting the sysctl
> to commit in close.  iozone does he fsync().  This helps in oldnfs but
> not in newfs.  With it, iozone on newfs now behaves like it did on oldnfs
> 10-20 years ago.  Something (perhaps just the timestamp bugs discussed
> later) "fixed" the discarding on oldnfs 5-10 years ago.
> 
> I think not committing in close is supposed to be an optimization, but
> it is actually a pessimization for my kernel build tests (with object
> files on nfs, which I normally avoid).  Builds certainly have to reopen
> files after writing them, to link them and perhaps to install them.
> This causes the discarding.  My kernel build tests also do a lot of
> utimes() calls which cause the discarding before commit-on-close can
> avoid the above cause for it it by clearing NMODIFIED.  Enabling
> commit-on-close gives a small optimisation with oldnfs by avoiding all
> of the discarding except for utimes().  It reduces read RPCs by about
> 25% without increasing write RPCs or real time.  It decreases real time
> by a few percent.
> 
Well, the new NFS client code was cloned from the old one (about FreeBSD7).
I did this so that the new client wouldn't exhibit different caching
behaviour than the old one (avoiding any POLA).
If you look in stable/10/sys/nfsclient/nfs_vnops.c and stable/10/sys/fs/nfsclient/nfs_clvnops.c
at the nfs_open() and nfs_close() functions, the algorithm appears to be
identical for NFSv3. (The new one has a bunch of NFSv4 gunk, but if you
scratch out that stuff and ignore function name differences (nfs_flush() vs
ncl_flush()), I think you'll find them the same. I couldn't spot any
differences at a glance.)
--> see r214513 in head/sys/fs/nfsclient/nfs_clvnops.c for example

> The other reason for discarding is because the timestamps changed -- you
> just wrote them, so the timestamps should have changed.  Different bugs
> in comparing the timestamps gave different misbehaviours.
> 
> In old versions of FreeBSD and/or nfs, the timestamps had seconds
> granularity, so many changes were missed.  This explains mysterious
> behaviours by iozone 10-20 years ago: the write caching is seen to
> work perfectly for most small total sizes, since all the writes take
> less than 1 second so the timestamps usually don't change (but sometimes
> the writes lie across a seconds boundary so the timestamps do change).
> 
> oldnfs was fixed many years ago to use timestamps with nanoseconds
> resolution, but it doesn't suffer from the discarding in nfs_open()
> in the !NMODIFIED case which is reached by either fsync() before close
> of commit on close.  I think this is because it updates n_mtime to
> the server's new timestamp in nfs_writerpc().  This seems to be wrong,
> since the file might have been written to by other clients and then
> the change would not be noticed until much later if ever (setting the
> timestamp prevents seeing it change when it is checked later, but you
> might be able to see another metadata change).
> 
> newfs has quite different code for nfs_writerpc().  Most of it was
> moved to another function in nanother file.  I understand this even
> less, but it doesn't seem to have fetch the server's new timestamp or
> update n_mtime in the v3 case.
> 
I'm pretty sure it does capture the new attributes (including mtime in
the reply. The function is called something like nfscl_loadattrcache().

In general, close-to-open consistency isn't needed for most mounts.
(The only case where it matters is when multiple clients are concurrently
 updating files.)
- There are a couple of options that might help performance when doing
  software builds on an NFS mount:
  nocto (I remember you don't like the name)
    - Actually, I can't remember why the code would still do the cache
      invalidation in nfs_open() when this is set. I wonder if the code
      in nfs_open() should maybe avoid invalidating the buffer cache
      when this is set? (I need to think about this.)
  noncontigwr - This one allows the writes to happen for byte aligned
      chunks when they are non-contiguous without pushing the individual
      writes to the server. (Again, this shouldn't cause problems unless
      multiple clients are writing to the file concurrently.)
Both of these are worth trying for mounts where software builds are being
done.

> There are many other reasons why nfs is slower than in old versions.
> One is that writes are more often done out of order.  This tends to
> give a slowness factor of about 2 unless the server can fix up the
> order.  I use an old server which can do the fixup for old clients but
> not for newer clients starting in about FreeBSD-9 (or 7?).
I actually thought this was mainly caused by the krpc that was introduced
in FreeBSD7 (for both old and new NFS), separating the RPC from NFS.
There are 2 layers in the krpc (sys/rpc/clnt_rc.c and sys/rpc/clnt_vc.c)
that each use acquisition of a mutex to allow an RPC message to be sent.
(Whichever thread happens to acquire the mutex first, sends first.)

I had a couple of patches that tried to keep the RPC messages more ordered.
(They would not have guaranteed exact ordering.) They seemed to help for
the limited testing I could do, but since I wasn't seeing a lot of
"out of order" reads/writes on my single core hardware, I couldn't verify
how well these patches worked. mav@ was working on this at the time, but
didn't get these patches tested either, from what I recall.
--> Unfortunately, I seem to have lost these patches or I would have
    attached them so you could try them. Ouch.
(I've cc'd mav at . Maybe he'll have them lying about. I think one was
 related to the nfsiod and the other for either sys/rpc/clnt_rc.c or
 sys/rpc/clnt_vc.c.)

The patches were all client side. Maybe I'll try and recreate them.

>  I suspect
> that this is just because Giant locking in old clients gave accidental
> serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
> for performance if you have multiple NICs serving multiple mounts.
Shared vnode locks are also a factor, at least for reads.
(Before shared vnode locks, the vnode lock essentially serialized all reads.)

As you note, a single threaded benchmark test is quite different than a lot
of clients with a lot of threads doing I/O on a lot of files concurrently.

The bandwidth * delay product of your network interconnect is also a factor.
The larger this is, the more bits you need to be in transit to "fill the data
pipe". You can increase the # of bits in transit by either using larger rsize/wsize
or more read-ahead/write-behind.

It would be nice to figure out why your case is performing better on the
old nfs client (and/or server).

If you have a fairly recent FreeBSD10 system, you could try doing mounts
with new vs old client (and no other changes) and see what differences
occur. (that would isolate new vs old from recent "old" and "really old")

Good luck with it, rick

> Other cases are less clear.  For the iozone benchmark, there is only
> 1 stream and multiple nfsiod's pessimize it into multiple streams that
> give buffers which arrive out of order on the server if the multiple
> nfsiod's are actually active.  I use the following configuration to
> ameliorate this, but the slowness factor is still often about 2 for
> iozone:
> - limit nfsd's to 4
> - limit nfsiod's to 4
> - limit nfs i/o sizes to 8K.  The server fs block size is 16K, and
>    using a smaller block size usually helps by giving some delayed
>    writes which can be clustered better.  (The non-nfs parts of the
>    server could be smarter and do this intentionally.  The out-of-order
>    buffers look like random writes to the server.)  16K i/o sizes
>    otherwise work OK, but 32K i/o sizes are much slower for unknown
>    reasons.
> 
> Bruce
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>