RFC: NFS client patch to reduce sychronous writes

Wed Nov 27 14:36:44 UTC 2013

Kostik wrote:
> On Tue, Nov 26, 2013 at 06:41:13PM -0500, Rick Macklem wrote:
> > Hi,
> > 
> > The current NFS client does a synchronous write
> > to the server when a non-contiguous write to the
> > same buffer cache block occurs. This is done because
> > there is a single dirty byte range recorded in the
> > buf structure. This results in a lot of synchronous
> > writes for software builds (I believe it is the loader
> > that loves to write small non-contiguous chunks to
> > its output file). Some users disable synchronous
> > writing on the server to improve performance, but
> > this puts them at risk of data loss when the server
> > crashes.
> > 
> > Long ago jhb@ emailed me a small patch that avoided
> > the synchronous writes by simply making the dirty byte
> > range a superset of the bytes written. The problem
> > with doing this is that for a rare (possibly non-existent)
> > application that writes non-overlapping byte ranges
> > to the same file from multiple clients concurrently,
> > some of these writes might get lost by stale data in
> > the superset of the byte range being written back to
> > the server. (Crappy, run on sentence, but hopefully
> > it makes sense;-)
> > 
> > I created a patch that maintained a list of dirty byte
> > ranges. It was complicated and I found that the list
> > often had to be > 100 entries to avoid the synchronous
> > writes.
> > 
> > So, I think his solution is preferable, although I've
> > added a couple of tweaks:
> > - The synchronous writes (old/current algorithm) is still
> >   used if there has been file locking done on the file.
> >   (I think any app. that writes a file from multiple clients
> >    will/should use file locking.)
> > - The synchronous writes (old/current algorithm) is used
> >   if a sysctl is set. This will avoid breakage for any app.
> >   (if there is one) that writes a file from multiple clients
> >   without doing file locking.
> My feeling is that global sysctl is too coarse granularity for the
> control. IMO the setting should be per-mount. It is not unreasonable
> to
> have shared files on one export, and perform private client
> operations
> on another. E.g. mailboxes and /usr/obj; I understand that mailboxes
> case should be covered by the advlock heuristic. But note that if
> advlock is acquired after some writes, the heuristic breaks.
> 
> Also, since the feature might cause very hard to diagnose corruption,
> I
> think a facility to detect that dirty range coalescing was done would
> be
> helpful. It could be a counter printed by mount -v, or even a warning
> printed once per mount.
> 
> > 
> > For testing on my very slow single core hardware, I see about
> > a 10% improvement in kernel build times, but with fewer I/O
> > RPCs:
> >              Read RPCs  Write RPCs
> > old/current  50K        122K
> > patched      39K         40K
> > --> it reduced the Read RPC count by about 20% and cut the
> >     Write RPC count to 1/3rd.
> > I think jhb@ saw pretty good performance results with his patch.
> > 
> > Anyhow, the patch is attached and can also be found here:
> >   http://people.freebsd.org/~rmacklem/noncontig-write.patch
> > 
> > I'd like to invite folks to comment/review/test this patch,
> > since I think it is ready for head/current.
> > 
> > Thanks, rick
> > ps: Kostik, maybe you could look at it. In particular, I am
> >     wondering if I zero'd out the buffer the correct way, via
> >     vfs_bio_bzero_buf()?
> Both vfs_bio_bzero_buf() and plain bzero() would work for NFS client,
> since it does not use unmapped buffers.  Use of vfs_bio_bzero_buf()
> is preferred since it makes one less place to find if converting NFS
> to unmapped i/o.
> 
> In fact, I do not understand why the zeroing is needed. Could you,
> please,
> comment on it ?
> 
> More, the zeroing seems to overwrite part of the previously valid
> content of the buffer, unless I mis-read the patch. This breaks
> the mmap/file consistency, since buffer pages may be mapped by
> usermode
> process, and it could observe the 'out of thin air' zeroes before the
> writing thread progressed to fill the zeroed part with new content.
> 
Oh, I suppose a way to fix the zeroing out of holes would be to zero out
pages whenever they are newly allocated to the file, via either the mmap()'d
or buffer cache route. This would result in a lot of zeroing to fix a
rather obscure (and rare?) case, but I suppose it's just a little CPU
overhead.

rick