silly write caching in nfs3

Sat Feb 27 06:14:40 UTC 2016

On Fri, 26 Feb 2016, Rick Macklem wrote:

> Bruce Evans wrote:
>> nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
>> reasons today.
>>
>> Writes have apparently always done silly caching.  Typical behaviour
>> is for iozone writing a 512MB file where the file fits in the buffer
>> cache/VMIO.  The write is cached perfectly.  But then when nfs_open()
>> reeopens the file, it calls vinvalbuf() to discard all of the cached
>> data.  Thus nfs write caching usually discards useful older data to
>> ...
>> I think not committing in close is supposed to be an optimization, but
>> it is actually a pessimization for my kernel build tests (with object
>> files on nfs, which I normally avoid).  Builds certainly have to reopen
>> files after writing them, to link them and perhaps to install them.
>> This causes the discarding.  My kernel build tests also do a lot of
>> utimes() calls which cause the discarding before commit-on-close can
>> avoid the above cause for it it by clearing NMODIFIED.  Enabling
>> commit-on-close gives a small optimisation with oldnfs by avoiding all
>> of the discarding except for utimes().  It reduces read RPCs by about
>> 25% without increasing write RPCs or real time.  It decreases real time
>> by a few percent.
>>
> Well, the new NFS client code was cloned from the old one (about FreeBSD7).
> I did this so that the new client wouldn't exhibit different caching
> behaviour than the old one (avoiding any POLA).
> If you look in stable/10/sys/nfsclient/nfs_vnops.c and stable/10/sys/fs/nfsclient/nfs_clvnops.c
> at the nfs_open() and nfs_close() functions, the algorithm appears to be
> identical for NFSv3. (The new one has a bunch of NFSv4 gunk, but if you
> scratch out that stuff and ignore function name differences (nfs_flush() vs
> ncl_flush()), I think you'll find them the same. I couldn't spot any
> differences at a glance.)
> --> see r214513 in head/sys/fs/nfsclient/nfs_clvnops.c for example

I blamed newnfs before :-), but when I looked at newnfs more closely I
found that it was almost the same lexically in the most interesting
places (but unfortunately has lexical differences from s/nfs/ncl/,
and but doesn't have enough of these differences for debugging --
debugging is broken by having 2 static functions named nfs_foo() for
many values of foo).  But newnfs seems to have always been missing this
critical code:

X   1541    rgrimes int
X  83651      peter nfs_writerpc(struct vnode *vp, struct uio *uiop, struct ucred *cred,
X 158739     mohans 	     int *iomode, int *must_commit)
X   1541    rgrimes {
X   9336        dfr 		if (v3) {
X   9336        dfr 			wccflag = NFSV3_WCCCHK;
X ...
X 158739     mohans 		}
X 158739     mohans 		if (wccflag) {
X 158739     mohans 			mtx_lock(&(VTONFS(vp))->n_mtx);
X 158739     mohans 			VTONFS(vp)->n_mtime = VTONFS(vp)->n_vattr.va_mtime;
                    			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
X 158739     mohans 			mtx_unlock(&(VTONFS(vp))->n_mtx);
X 158739     mohans 		}

This was in 4.4BSD-Lite1 under a slightly (?) different condition.

BTW, how do you use svn to see the history of removed files?  nfs_vnops.c
has been removed in -current.  I can find it in other branches, but is
hard to find there even if you know where it is.  This is no better than
in cvs, where to find its full history I have cd know where it is in 3
different repositories that I have online and more that I should have.

>> The other reason for discarding is because the timestamps changed -- you
>> just wrote them, so the timestamps should have changed.  Different bugs
>> in comparing the timestamps gave different misbehaviours.
>>
>> In old versions of FreeBSD and/or nfs, the timestamps had seconds
>> granularity, so many changes were missed.  This explains mysterious
>> behaviours by iozone 10-20 years ago: the write caching is seen to
>> work perfectly for most small total sizes, since all the writes take
>> less than 1 second so the timestamps usually don't change (but sometimes
>> the writes lie across a seconds boundary so the timestamps do change).
>>
>> oldnfs was fixed many years ago to use timestamps with nanoseconds
>> resolution, but it doesn't suffer from the discarding in nfs_open()
>> in the !NMODIFIED case which is reached by either fsync() before close
>> of commit on close.  I think this is because it updates n_mtime to
>> the server's new timestamp in nfs_writerpc().  This seems to be wrong,
>> since the file might have been written to by other clients and then
>> the change would not be noticed until much later if ever (setting the
>> timestamp prevents seeing it change when it is checked later, but you
>> might be able to see another metadata change).
>>
>> newfs has quite different code for nfs_writerpc().  Most of it was
>> moved to another function in nanother file.  I understand this even
>> less, but it doesn't seem to have fetch the server's new timestamp or
>> update n_mtime in the v3 case.
>>
> I'm pretty sure it does capture the new attributes (including mtime in
> the reply. The function is called something like nfscl_loadattrcache().

Debugging shows that it loads the new attributes but doesn't clobber
n_mtime with them.  For a write test that takes 20 seconds, n_mtime sticks
at its original value and the server time advances with each write by 20
seconds total (the server time only advances every second if the server
timestamp precision is only 1 second).

> In general, close-to-open consistency isn't needed for most mounts.
> (The only case where it matters is when multiple clients are concurrently
> updating files.)
> - There are a couple of options that might help performance when doing
>  software builds on an NFS mount:
>  nocto (I remember you don't like the name)

I actually do like it except for its negative logic.  To turn it back on,
you would need to use nonocto, but IIRC the negative logic for that is
still broken (missing), so there is no way to turn it back on.

>    - Actually, I can't remember why the code would still do the cache
>      invalidation in nfs_open() when this is set. I wonder if the code
>      in nfs_open() should maybe avoid invalidating the buffer cache
>      when this is set? (I need to think about this.)

I think it is technically correct for something to do the invalidation
if NMODIFIED is still set in nfs_open().  nocto shouldn't and doesn't
affect that.  nocto is checked only in nfs_lookup() and only affects
nfs_open() indirectly: its effect is that when nocto is not set,
nfs_lookup() clears n_attrstamp which causes nfs_lookup() to do more,
but hopefully still not cache invalidation.  Cache invalidation is
also done after a timeout and nocto doesn't affect that either.

I still leave nocto off except for testing.  I want to optimise the
cto case, and my reference benchmarks are with cto.

>  noncontigwr - This one allows the writes to happen for byte aligned
>      chunks when they are non-contiguous without pushing the individual
>      writes to the server. (Again, this shouldn't cause problems unless
>      multiple clients are writing to the file concurrently.)
> Both of these are worth trying for mounts where software builds are being
> done.

I tried this to see if it would fix the unordered writes.  I didn't
expect it to do much because I usually only have a single active
client and a single active writer per file.  It didn't make much
difference.

With nfsiods misordering writes, this option might give another source
of silly writes.  After it merges writes to give perfect contiguity,
you send them to multiple nfsiods which might give perfect discontiguity
(worse than random) :-).

>> There are many other reasons why nfs is slower than in old versions.
>> One is that writes are more often done out of order.  This tends to
>> give a slowness factor of about 2 unless the server can fix up the
>> order.  I use an old server which can do the fixup for old clients but
>> not for newer clients starting in about FreeBSD-9 (or 7?).
> I actually thought this was mainly caused by the krpc that was introduced
> in FreeBSD7 (for both old and new NFS), separating the RPC from NFS.
> There are 2 layers in the krpc (sys/rpc/clnt_rc.c and sys/rpc/clnt_vc.c)
> that each use acquisition of a mutex to allow an RPC message to be sent.
> (Whichever thread happens to acquire the mutex first, sends first.)

I don't like the new krpc since it is larger and harder to debug
(especially for me since I don't understand the old krpc either :-),
but it is in FreeBSD-7 and in my main reference kernel r181717, and
these don't have so many unordered blocks for at leasy writing.

> I had a couple of patches that tried to keep the RPC messages more ordered.
> (They would not have guaranteed exact ordering.) They seemed to help for
> the limited testing I could do, but since I wasn't seeing a lot of
> "out of order" reads/writes on my single core hardware, I couldn't verify

I usually use single core hardware too, but saw some problems with 2 cores,
and now with 8 cores the problems seem to be fundamental.

> how well these patches worked. mav@ was working on this at the time, but
> didn't get these patches tested either, from what I recall.
> --> Unfortunately, I seem to have lost these patches or I would have
>    attached them so you could try them. Ouch.
> (I've cc'd mav at . Maybe he'll have them lying about. I think one was
> related to the nfsiod and the other for either sys/rpc/clnt_rc.c or
> sys/rpc/clnt_vc.c.)
>
> The patches were all client side. Maybe I'll try and recreate them.

It seems to require lots of communication between separate nfsiods to
even preserve an order that has carefully been set up for them.  If
you have this then it is unclear why it can't be done more simply using
a single nfsiod thread (per NIC or ifq).  Only 1 thread should talk to
the NIC/ifq, since you lose control if put other threads in between.
If the NIC/ifq uses multiple threads then maintaining the order is its
problem.

>>  I suspect
>> that this is just because Giant locking in old clients gave accidental
>> serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
>> for performance if you have multiple NICs serving multiple mounts.
> Shared vnode locks are also a factor, at least for reads.
> (Before shared vnode locks, the vnode lock essentially serialized all reads.)
>
> As you note, a single threaded benchmark test is quite different than a lot
> of clients with a lot of threads doing I/O on a lot of files concurrently.

It is also an important case for me.  I mainly want creating of object files
to be fast, and the cache invalidation and unordered blocks seem to be
relatively even larger in this case.  A typical compiler operation (if
tmp or obj files are on nfs, which they shouldn't be but sometimes are) is:

     cc -pipe avoids creating intermediate file for preprocessor
     cc -c creates intermediate .S file for some compilers (not clang)
        .S file is written to cache
        reopening .S file to actually use it invalidates cache
        (workaround:
        (1) enable commit on close to clear NMODIFED, and
        (2) use 1-second timestamp resolution on server to break detection
 	   of the change, provided the file can be created and reopened
 	   without crossing a seconds boundary), or
        (3) use oldnfs
     cc -c creates intermediate .o file for all compilers
        similar considerations
     link step uses .o files and invalidates their cache in most cases
        (workaround: as above, except the whole compile usually takes
        more than 1 second, so the timestamp resolution hack doesn't work)
     install step
        similar considerations -- the linked file was the intermediate file
        for this step and reopening invalidates its cache.

> The bandwidth * delay product of your network interconnect is also a factor.
> The larger this is, the more bits you need to be in transit to "fill the data
> pipe". You can increase the # of bits in transit by either using larger rsize/wsize
> or more read-ahead/write-behind.

I already have latency tuned to about 5-10 times smaller than on FreeBSD
cluster machines, with the result that most operations are even more than
5-10 faster, due to smaller operations and most operations not having
special support for keeping pipes full (if that is possible at all).

> It would be nice to figure out why your case is performing better on the
> old nfs client (and/or server).
>
> If you have a fairly recent FreeBSD10 system, you could try doing mounts
> with new vs old client (and no other changes) and see what differences
> occur. (that would isolate new vs old from recent "old" and "really old")

Er, that is what I already did to isolate this problem.  I have oldnfs and
newfs in about 50 test kernels and finally isolated this problem in an
up to date FreeBSD-10.

Bruce