silly write caching in nfs3

Sun Feb 28 01:25:52 UTC 2016

Bruce Evans wrote:
> On Fri, 26 Feb 2016, Rick Macklem wrote:
> 
> > Bruce Evans wrote:
> >> nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
> >> reasons today.
> >>
> >> Writes have apparently always done silly caching.  Typical behaviour
> >> is for iozone writing a 512MB file where the file fits in the buffer
> >> cache/VMIO.  The write is cached perfectly.  But then when nfs_open()
> >> reeopens the file, it calls vinvalbuf() to discard all of the cached
> >> data.  Thus nfs write caching usually discards useful older data to
> >> ...
> >> I think not committing in close is supposed to be an optimization, but
> >> it is actually a pessimization for my kernel build tests (with object
> >> files on nfs, which I normally avoid).  Builds certainly have to reopen
> >> files after writing them, to link them and perhaps to install them.
> >> This causes the discarding.  My kernel build tests also do a lot of
> >> utimes() calls which cause the discarding before commit-on-close can
> >> avoid the above cause for it it by clearing NMODIFIED.  Enabling
> >> commit-on-close gives a small optimisation with oldnfs by avoiding all
> >> of the discarding except for utimes().  It reduces read RPCs by about
> >> 25% without increasing write RPCs or real time.  It decreases real time
> >> by a few percent.
> >>
> > Well, the new NFS client code was cloned from the old one (about FreeBSD7).
> > I did this so that the new client wouldn't exhibit different caching
> > behaviour than the old one (avoiding any POLA).
> > If you look in stable/10/sys/nfsclient/nfs_vnops.c and
> > stable/10/sys/fs/nfsclient/nfs_clvnops.c
> > at the nfs_open() and nfs_close() functions, the algorithm appears to be
> > identical for NFSv3. (The new one has a bunch of NFSv4 gunk, but if you
> > scratch out that stuff and ignore function name differences (nfs_flush() vs
> > ncl_flush()), I think you'll find them the same. I couldn't spot any
> > differences at a glance.)
> > --> see r214513 in head/sys/fs/nfsclient/nfs_clvnops.c for example
> 
> I blamed newnfs before :-), but when I looked at newnfs more closely I
> found that it was almost the same lexically in the most interesting
> places (but unfortunately has lexical differences from s/nfs/ncl/,
> and but doesn't have enough of these differences for debugging --
> debugging is broken by having 2 static functions named nfs_foo() for
> many values of foo).  But newnfs seems to have always been missing this
> critical code:
> 
> X   1541    rgrimes int
> X  83651      peter nfs_writerpc(struct vnode *vp, struct uio *uiop, struct
> ucred *cred,
> X 158739     mohans 	     int *iomode, int *must_commit)
> X   1541    rgrimes {
> X   9336        dfr 		if (v3) {
> X   9336        dfr 			wccflag = NFSV3_WCCCHK;
> X ...
> X 158739     mohans 		}
> X 158739     mohans 		if (wccflag) {
> X 158739     mohans 			mtx_lock(&(VTONFS(vp))->n_mtx);
> X 158739     mohans 			VTONFS(vp)->n_mtime = VTONFS(vp)->n_vattr.va_mtime;
>                     			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> X 158739     mohans 			mtx_unlock(&(VTONFS(vp))->n_mtx);
> X 158739     mohans 		}
> 
Well, this code does exist in the new client. The function is nfsrpc_writerpc()
found in sys/fs/nfsclient/nfs_clrpcops.c. It calls nfscl_wcc_data() which does
the same test as nfsm_wcc_data_xx() did in the old client.
Then NFSWRITERPC_SETTIME() set the time if wccflag was set.

However...NFSWRITERPC_SETTIME() is broken and sets n_mtime from the one in
the nfs vnode (same as the old code). Unfortuantely the cached value in the nfs
vnode hasn't been updated at that point. (The RPC layer functions don't do cache
stuff.)

The attached patch fixes this. Please test with the attached patch.

Good catch. Thanks for reporting it.
> This was in 4.4BSD-Lite1 under a slightly (?) different condition.
> 
> BTW, how do you use svn to see the history of removed files?  nfs_vnops.c
> has been removed in -current.  I can find it in other branches, but is
> hard to find there even if you know where it is.  This is no better than
> in cvs, where to find its full history I have cd know where it is in 3
> different repositories that I have online and more that I should have.
> 
> 
> >> The other reason for discarding is because the timestamps changed -- you
> >> just wrote them, so the timestamps should have changed.  Different bugs
> >> in comparing the timestamps gave different misbehaviours.
> >>
> >> In old versions of FreeBSD and/or nfs, the timestamps had seconds
> >> granularity, so many changes were missed.  This explains mysterious
> >> behaviours by iozone 10-20 years ago: the write caching is seen to
> >> work perfectly for most small total sizes, since all the writes take
> >> less than 1 second so the timestamps usually don't change (but sometimes
> >> the writes lie across a seconds boundary so the timestamps do change).
> >>
> >> oldnfs was fixed many years ago to use timestamps with nanoseconds
> >> resolution, but it doesn't suffer from the discarding in nfs_open()
> >> in the !NMODIFIED case which is reached by either fsync() before close
> >> of commit on close.  I think this is because it updates n_mtime to
> >> the server's new timestamp in nfs_writerpc().  This seems to be wrong,
> >> since the file might have been written to by other clients and then
> >> the change would not be noticed until much later if ever (setting the
> >> timestamp prevents seeing it change when it is checked later, but you
> >> might be able to see another metadata change).
> >>
> >> newfs has quite different code for nfs_writerpc().  Most of it was
> >> moved to another function in nanother file.  I understand this even
> >> less, but it doesn't seem to have fetch the server's new timestamp or
> >> update n_mtime in the v3 case.
> >>
> > I'm pretty sure it does capture the new attributes (including mtime in
> > the reply. The function is called something like nfscl_loadattrcache().
> 
> Debugging shows that it loads the new attributes but doesn't clobber
> n_mtime with them.  For a write test that takes 20 seconds, n_mtime sticks
> at its original value and the server time advances with each write by 20
> seconds total (the server time only advances every second if the server
> timestamp precision is only 1 second).
> 
Yep. The attached patch fixes this.

> > In general, close-to-open consistency isn't needed for most mounts.
> > (The only case where it matters is when multiple clients are concurrently
> > updating files.)
> > - There are a couple of options that might help performance when doing
> >  software builds on an NFS mount:
> >  nocto (I remember you don't like the name)
> 
> I actually do like it except for its negative logic.  To turn it back on,
> you would need to use nonocto, but IIRC the negative logic for that is
> still broken (missing), so there is no way to turn it back on.
> 
> >    - Actually, I can't remember why the code would still do the cache
> >      invalidation in nfs_open() when this is set. I wonder if the code
> >      in nfs_open() should maybe avoid invalidating the buffer cache
> >      when this is set? (I need to think about this.)
> 
> I think it is technically correct for something to do the invalidation
> if NMODIFIED is still set in nfs_open().  nocto shouldn't and doesn't
> affect that.  nocto is checked only in nfs_lookup() and only affects
> nfs_open() indirectly: its effect is that when nocto is not set,
> nfs_lookup() clears n_attrstamp which causes nfs_lookup() to do more,
> but hopefully still not cache invalidation.  Cache invalidation is
> also done after a timeout and nocto doesn't affect that either.
> 
> I still leave nocto off except for testing.  I want to optimise the
> cto case, and my reference benchmarks are with cto.
> 
> >  noncontigwr - This one allows the writes to happen for byte aligned
> >      chunks when they are non-contiguous without pushing the individual
> >      writes to the server. (Again, this shouldn't cause problems unless
> >      multiple clients are writing to the file concurrently.)
> > Both of these are worth trying for mounts where software builds are being
> > done.
> 
> I tried this to see if it would fix the unordered writes.  I didn't
> expect it to do much because I usually only have a single active
> client and a single active writer per file.  It didn't make much
> difference.
> 
> With nfsiods misordering writes, this option might give another source
> of silly writes.  After it merges writes to give perfect contiguity,
> you send them to multiple nfsiods which might give perfect discontiguity
> (worse than random) :-).
> 
> >> There are many other reasons why nfs is slower than in old versions.
> >> One is that writes are more often done out of order.  This tends to
> >> give a slowness factor of about 2 unless the server can fix up the
> >> order.  I use an old server which can do the fixup for old clients but
> >> not for newer clients starting in about FreeBSD-9 (or 7?).
> > I actually thought this was mainly caused by the krpc that was introduced
> > in FreeBSD7 (for both old and new NFS), separating the RPC from NFS.
> > There are 2 layers in the krpc (sys/rpc/clnt_rc.c and sys/rpc/clnt_vc.c)
> > that each use acquisition of a mutex to allow an RPC message to be sent.
> > (Whichever thread happens to acquire the mutex first, sends first.)
> 
> I don't like the new krpc since it is larger and harder to debug
> (especially for me since I don't understand the old krpc either :-),
> but it is in FreeBSD-7 and in my main reference kernel r181717, and
> these don't have so many unordered blocks for at leasy writing.
> 
> > I had a couple of patches that tried to keep the RPC messages more ordered.
> > (They would not have guaranteed exact ordering.) They seemed to help for
> > the limited testing I could do, but since I wasn't seeing a lot of
> > "out of order" reads/writes on my single core hardware, I couldn't verify
> 
> I usually use single core hardware too, but saw some problems with 2 cores,
> and now with 8 cores the problems seem to be fundamental.
> 
> > how well these patches worked. mav@ was working on this at the time, but
> > didn't get these patches tested either, from what I recall.
> > --> Unfortunately, I seem to have lost these patches or I would have
> >    attached them so you could try them. Ouch.
> > (I've cc'd mav at . Maybe he'll have them lying about. I think one was
> > related to the nfsiod and the other for either sys/rpc/clnt_rc.c or
> > sys/rpc/clnt_vc.c.)
> >
> > The patches were all client side. Maybe I'll try and recreate them.
> 
> It seems to require lots of communication between separate nfsiods to
> even preserve an order that has carefully been set up for them.  If
> you have this then it is unclear why it can't be done more simply using
> a single nfsiod thread (per NIC or ifq).  Only 1 thread should talk to
> the NIC/ifq, since you lose control if put other threads in between.
> If the NIC/ifq uses multiple threads then maintaining the order is its
> problem.
> 
Yes. The patches I had didn't guarantee complete ordering in part because the
one was for the nfsiods and the other for the RPC request in the krpc code.
I will try and redo them one of these days.

I will mention that the NFS RFCs have no requirement nor recommendation for
ordering. The RPCs are assumed to be atomic entities. Having said that, I
do believe that FreeBSD servers will perform better when the reads/writes are
in sequentially increasing byte order, so this is worth working on.

> >>  I suspect
> >> that this is just because Giant locking in old clients gave accidental
> >> serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
> >> for performance if you have multiple NICs serving multiple mounts.
> > Shared vnode locks are also a factor, at least for reads.
> > (Before shared vnode locks, the vnode lock essentially serialized all
> > reads.)
> >
> > As you note, a single threaded benchmark test is quite different than a lot
> > of clients with a lot of threads doing I/O on a lot of files concurrently.
> 
> It is also an important case for me.  I mainly want creating of object files
> to be fast, and the cache invalidation and unordered blocks seem to be
> relatively even larger in this case.  A typical compiler operation (if
> tmp or obj files are on nfs, which they shouldn't be but sometimes are) is:
> 
>      cc -pipe avoids creating intermediate file for preprocessor
>      cc -c creates intermediate .S file for some compilers (not clang)
>         .S file is written to cache
>         reopening .S file to actually use it invalidates cache
>         (workaround:
>         (1) enable commit on close to clear NMODIFED, and
>         (2) use 1-second timestamp resolution on server to break detection
>  	   of the change, provided the file can be created and reopened
>  	   without crossing a seconds boundary), or
>         (3) use oldnfs
>      cc -c creates intermediate .o file for all compilers
>         similar considerations
>      link step uses .o files and invalidates their cache in most cases
>         (workaround: as above, except the whole compile usually takes
>         more than 1 second, so the timestamp resolution hack doesn't work)
>      install step
>         similar considerations -- the linked file was the intermediate file
>         for this step and reopening invalidates its cache.
> 
> > The bandwidth * delay product of your network interconnect is also a
> > factor.
> > The larger this is, the more bits you need to be in transit to "fill the
> > data
> > pipe". You can increase the # of bits in transit by either using larger
> > rsize/wsize
> > or more read-ahead/write-behind.
> 
> I already have latency tuned to about 5-10 times smaller than on FreeBSD
> cluster machines, with the result that most operations are even more than
> 5-10 faster, due to smaller operations and most operations not having
> special support for keeping pipes full (if that is possible at all).
> 
> > It would be nice to figure out why your case is performing better on the
> > old nfs client (and/or server).
> >
> > If you have a fairly recent FreeBSD10 system, you could try doing mounts
> > with new vs old client (and no other changes) and see what differences
> > occur. (that would isolate new vs old from recent "old" and "really old")
> 
> Er, that is what I already did to isolate this problem.  I have oldnfs and
> newfs in about 50 test kernels and finally isolated this problem in an
> up to date FreeBSD-10.
> 
> Bruce
> 
Thanks for finding the "n_mtime isn't getting updated by writes" bug.

The attached fix should fix it, rick
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nfswriterpc.patch
Type: text/x-patch
Size: 1116 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20160227/9251a783/attachment.bin>