silly write caching in nfs3

Sun Feb 28 04:39:02 UTC 2016

On Sat, 27 Feb 2016, Rick Macklem wrote:

> Bruce Evans wrote:
>> ...
>> I blamed newnfs before :-), but when I looked at newnfs more closely I
>> found that it was almost the same lexically in the most interesting
>> places (but unfortunately has lexical differences from s/nfs/ncl/,
>> and but doesn't have enough of these differences for debugging --
>> debugging is broken by having 2 static functions named nfs_foo() for
>> many values of foo).  But newnfs seems to have always been missing this
>> critical code:
>>
>> X   1541    rgrimes int
>> X  83651      peter nfs_writerpc(struct vnode *vp, struct uio *uiop, struct
>> ucred *cred,
>> X 158739     mohans 	     int *iomode, int *must_commit)
>> X   1541    rgrimes {
>> X   9336        dfr 		if (v3) {
>> X   9336        dfr 			wccflag = NFSV3_WCCCHK;
>> X ...
>> X 158739     mohans 		}
>> X 158739     mohans 		if (wccflag) {
>> X 158739     mohans 			mtx_lock(&(VTONFS(vp))->n_mtx);
>> X 158739     mohans 			VTONFS(vp)->n_mtime = VTONFS(vp)->n_vattr.va_mtime;
>>                     			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> X 158739     mohans 			mtx_unlock(&(VTONFS(vp))->n_mtx);
>> X 158739     mohans 		}
>>
> Well, this code does exist in the new client. The function is nfsrpc_writerpc()
> found in sys/fs/nfsclient/nfs_clrpcops.c. It calls nfscl_wcc_data() which does
> the same test as nfsm_wcc_data_xx() did in the old client.
> Then NFSWRITERPC_SETTIME() set the time if wccflag was set.
>
> However...NFSWRITERPC_SETTIME() is broken and sets n_mtime from the one in
> the nfs vnode (same as the old code). Unfortuantely the cached value in the nfs
> vnode hasn't been updated at that point. (The RPC layer functions don't do cache
> stuff.)
>
> The attached patch fixes this. Please test with the attached patch.

This passes my tests.  newfs now works slightly better than oldnfs in
FreeBSD-10 (higher throughput in re-read test and fewer read RPCs in
compile tests).

The negative regression in read RPCs might be a bug somewhere.  In a larger
compile test FreeBSD-11 has the opposite problem of many more read RPCs.
I haven't started isolating this.  Previous tests isolated the following
regressions in the number of RPCs:
- r208602:208603 gives more correct attribute cache clearing but more RPCs
- r209947:209948 gives the same
- r247115:247116 gives a 5-10% increase in sys time for makeworld by doing
   lots of locking in a slow way (thread_lock() in sigdefer/allowstop()).
   Removing the thread locking and doing racy accesses to td_flags reduces
   this overhead to below 5%.  Removing the calls to sigdefer/allowstop()
   reduces it to a hard-to-measure amount (thus uses even more flags in
   VFS_PROLOGUE/EPILOGUE()) and doesn't remove the overhead for other file
   systems, but branch prediction apparently works well unless there are
   function calls there).

For building a single kernel, in FreeBSD-[8-10] the RPC counts are
down by about 10% relative to my reference version (which has an old
implementation of negative name caching and cto improvements and a
hack to work around broken dotdot namecaching), but for makeworld they
are up by about 5%.  The increase is probably another namecache or
dirents bug, or just from different cache timeouts.

> ...
>> I tried this to see if it would fix the unordered writes.  I didn't
>> expect it to do much because I usually only have a single active
>> client and a single active writer per file.  It didn't make much
>> difference.

Today I tested on a 2-core system.  The loss from extra nfsiods was
much smaller than on a faster 8-core system with each core 2-3 times
as fast and a slower (max throughput 28MB/S down from 70) but slightly
lower latency network (ping latency 63 usec down from 75) (no loss up to
a file size of about 32MB).

> ...
>>> The patches were all client side. Maybe I'll try and recreate them.
>>
>> It seems to require lots of communication between separate nfsiods to
>> even preserve an order that has carefully been set up for them.  If
>> you have this then it is unclear why it can't be done more simply using
>> a single nfsiod thread (per NIC or ifq).  Only 1 thread should talk to
>> the NIC/ifq, since you lose control if put other threads in between.
>> If the NIC/ifq uses multiple threads then maintaining the order is its
>> problem.
>
> Yes. The patches I had didn't guarantee complete ordering in part because the
> one was for the nfsiods and the other for the RPC request in the krpc code.
> I will try and redo them one of these days.
>
> I will mention that the NFS RFCs have no requirement nor recommendation for
> ordering. The RPCs are assumed to be atomic entities. Having said that, I
> do believe that FreeBSD servers will perform better when the reads/writes are
> in sequentially increasing byte order, so this is worth working on.

FreeBSD servers do lots of clustering, but not very well.

Write clustering is easier than read clustering since it is reasonable to
wait a bit to combine writes.

A single client with a single writer is an easy case.  For a single client
with multiple writers, I think the nfsiods should try to keep the writes
sequential for each writer but not across writers.  This gives an order
similar to local writers.  FreeBSD servers (except maybe zfs ones) don't
handle multiple writers well, but an nfs client can't be expected to
understand the server's deficiencies better that the server does so that
they can be avoided.

Bruce