Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Fri Aug 26 01:16:19 UTC 2011

Bob Friesenhahn wrote:
> On Thu, 25 Aug 2011, John Baldwin wrote:
> 
> > I was doing some analysis of compiles over NFS at work recently and
> > noticed
> > from 'iostat 1' on the NFS server that all my NFS writes were always
> > 16k
> > writes (meaning that writes were never being clustered). I added
> > some
> > debugging sysctls to the NFS client and server code as well as the
> > FFS write
> > VOP to figure out the various kind of write requests that were being
> > sent. I
> > found that during the NFS compile, the NFS client was sending a lot
> > of
> > FILESYNC writes even though nothing in the compile process uses
> > fsync().
> 
> A fundamental principle of NFS is that writes are synchronous so that
> if the server spontaneously reboots, all the acknowledged writes will
> still be present on disk and the client just continues (after a delay)
> without loss/corruption of data. NFSv3 added the ability to send
> uncommitted data to the server, with the agreement that the client
> would agree to re-send any uncommitted data if the server
> spontaneously rebooted. Most clients are not responsibly prepared to
> participate in this since it would require some non-volatile local
> storage on the client.
> 
Although I wouldn't want to say it's bug free, I believe that the FreeBSD
NFS client code (the new client clones the old one in this regard) does
handle UNSTABLE (data that will be committed later or re-written if the
server reboots before the Commit RPC completes).

I have tested this a little and it seemed to work, including doing the
write RPCs again, if the server was rebooted before the Commit RPC
completed.

I think that the tradition of asynchronous writes (where the RPC is
started right away) needs to be largely replaced by delayed writes
(just mark the block dirty and write it back sometime later). The
trick here is to avoid flooding the buffer cache or generating large
bursts of write RPCs by doing the write backs at an appropriate rate
and using the largest size possible. (NFSv3,4 servers specify the
largest write RPC size they can handle. As I noted in the other post,
this is 1Mbyte for Solaris10 and I'd like to see the FreeBSD server
doing the same, but it's currently only MAX_BSIZE == 64K.)

> I don't know if your changes would harm these expectations.
> 
> Regardless, there is little doubt that the default client NFS in
> FreeBSD 8.2 suffers quite a lot in sequential write performance as
> compared with an OS like Solaris. Hopefully the new NFS that Rick
> Macklem has been working on (and is apparently ready for general use)
> will perform much better. Since FreeBSD is switching to the new
> implementation it seems like that is where the efforts should be
> going.
> 
Well, the two clients are clones w.r.t. the buffer cache stuff at
this point. I did that because:
1 - I don't understand the buffer cache code well enough to modify it
    without breaking it.
2 - I wanted the 2 clients to be "bug compatible" during the switchover
    of defaults.
Given this, the performance will be about the same at this point.

However, getting the clients to do less synchronous writing (both w.r.t.
doing them right away and w.r.t. setting FILESYNC instead of UNSTABLE)
and fewer big write RPCs could be worth the effort, I think?

One thing I do have in the "futures" list (I should have a patch that
can be tested by others out soon) does client side on-disk caching,
but only for the specific case where the client holds an NFSv4 delegation
for the file. (I call this Packrats, so when you see a posting about
a Packrats patch, this is what it is and if you can try it, please do so.
You might like how it performs.:-)

rick

> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/