RFC: NFS client patch to reduce sychronous writes

Konstantin Belousov kostikbel at gmail.com
Wed Nov 27 08:52:52 UTC 2013


On Tue, Nov 26, 2013 at 06:41:13PM -0500, Rick Macklem wrote:
> Hi,
> 
> The current NFS client does a synchronous write
> to the server when a non-contiguous write to the
> same buffer cache block occurs. This is done because
> there is a single dirty byte range recorded in the
> buf structure. This results in a lot of synchronous
> writes for software builds (I believe it is the loader
> that loves to write small non-contiguous chunks to
> its output file). Some users disable synchronous
> writing on the server to improve performance, but
> this puts them at risk of data loss when the server
> crashes.
> 
> Long ago jhb@ emailed me a small patch that avoided
> the synchronous writes by simply making the dirty byte
> range a superset of the bytes written. The problem
> with doing this is that for a rare (possibly non-existent)
> application that writes non-overlapping byte ranges
> to the same file from multiple clients concurrently,
> some of these writes might get lost by stale data in
> the superset of the byte range being written back to
> the server. (Crappy, run on sentence, but hopefully
> it makes sense;-)
> 
> I created a patch that maintained a list of dirty byte
> ranges. It was complicated and I found that the list
> often had to be > 100 entries to avoid the synchronous
> writes.
> 
> So, I think his solution is preferable, although I've
> added a couple of tweaks:
> - The synchronous writes (old/current algorithm) is still
>   used if there has been file locking done on the file.
>   (I think any app. that writes a file from multiple clients
>    will/should use file locking.)
> - The synchronous writes (old/current algorithm) is used
>   if a sysctl is set. This will avoid breakage for any app.
>   (if there is one) that writes a file from multiple clients
>   without doing file locking.
My feeling is that global sysctl is too coarse granularity for the
control. IMO the setting should be per-mount. It is not unreasonable to
have shared files on one export, and perform private client operations
on another. E.g. mailboxes and /usr/obj; I understand that mailboxes
case should be covered by the advlock heuristic. But note that if
advlock is acquired after some writes, the heuristic breaks.

Also, since the feature might cause very hard to diagnose corruption, I
think a facility to detect that dirty range coalescing was done would be
helpful. It could be a counter printed by mount -v, or even a warning
printed once per mount.

> 
> For testing on my very slow single core hardware, I see about
> a 10% improvement in kernel build times, but with fewer I/O
> RPCs:
>              Read RPCs  Write RPCs
> old/current  50K        122K
> patched      39K         40K
> --> it reduced the Read RPC count by about 20% and cut the
>     Write RPC count to 1/3rd.
> I think jhb@ saw pretty good performance results with his patch.
> 
> Anyhow, the patch is attached and can also be found here:
>   http://people.freebsd.org/~rmacklem/noncontig-write.patch
> 
> I'd like to invite folks to comment/review/test this patch,
> since I think it is ready for head/current.
> 
> Thanks, rick
> ps: Kostik, maybe you could look at it. In particular, I am
>     wondering if I zero'd out the buffer the correct way, via
>     vfs_bio_bzero_buf()?
Both vfs_bio_bzero_buf() and plain bzero() would work for NFS client,
since it does not use unmapped buffers.  Use of vfs_bio_bzero_buf()
is preferred since it makes one less place to find if converting NFS
to unmapped i/o.

In fact, I do not understand why the zeroing is needed. Could you, please,
comment on it ?

More, the zeroing seems to overwrite part of the previously valid
content of the buffer, unless I mis-read the patch. This breaks
the mmap/file consistency, since buffer pages may be mapped by usermode
process, and it could observe the 'out of thin air' zeroes before the
writing thread progressed to fill the zeroed part with new content.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20131127/b1b33580/attachment.sig>


More information about the freebsd-fs mailing list