RFC: NFS client patch to reduce sychronous writes
Rick Macklem
rmacklem at uoguelph.ca
Wed Nov 27 14:28:57 UTC 2013
Kostik wrote:
> On Tue, Nov 26, 2013 at 06:41:13PM -0500, Rick Macklem wrote:
> > Hi,
> >
> > The current NFS client does a synchronous write
> > to the server when a non-contiguous write to the
> > same buffer cache block occurs. This is done because
> > there is a single dirty byte range recorded in the
> > buf structure. This results in a lot of synchronous
> > writes for software builds (I believe it is the loader
> > that loves to write small non-contiguous chunks to
> > its output file). Some users disable synchronous
> > writing on the server to improve performance, but
> > this puts them at risk of data loss when the server
> > crashes.
> >
> > Long ago jhb@ emailed me a small patch that avoided
> > the synchronous writes by simply making the dirty byte
> > range a superset of the bytes written. The problem
> > with doing this is that for a rare (possibly non-existent)
> > application that writes non-overlapping byte ranges
> > to the same file from multiple clients concurrently,
> > some of these writes might get lost by stale data in
> > the superset of the byte range being written back to
> > the server. (Crappy, run on sentence, but hopefully
> > it makes sense;-)
> >
> > I created a patch that maintained a list of dirty byte
> > ranges. It was complicated and I found that the list
> > often had to be > 100 entries to avoid the synchronous
> > writes.
> >
> > So, I think his solution is preferable, although I've
> > added a couple of tweaks:
> > - The synchronous writes (old/current algorithm) is still
> > used if there has been file locking done on the file.
> > (I think any app. that writes a file from multiple clients
> > will/should use file locking.)
> > - The synchronous writes (old/current algorithm) is used
> > if a sysctl is set. This will avoid breakage for any app.
> > (if there is one) that writes a file from multiple clients
> > without doing file locking.
> My feeling is that global sysctl is too coarse granularity for the
> control. IMO the setting should be per-mount. It is not unreasonable
> to
> have shared files on one export, and perform private client
> operations
> on another. E.g. mailboxes and /usr/obj; I understand that mailboxes
> case should be covered by the advlock heuristic. But note that if
> advlock is acquired after some writes, the heuristic breaks.
>
> Also, since the feature might cause very hard to diagnose corruption,
> I
> think a facility to detect that dirty range coalescing was done would
> be
> helpful. It could be a counter printed by mount -v, or even a warning
> printed once per mount.
>
Ok, I can make it a mount option. The only reason I didn't do that is
I was trying to avoid "yet another" mount option that many don't know
when to use properly. I'll try and come up with a good explanation for
the man page.
I suppose a mount option makes more sense if it enables the new behaviour,
which will avoid any change by default (and no POLA).
> >
> > For testing on my very slow single core hardware, I see about
> > a 10% improvement in kernel build times, but with fewer I/O
> > RPCs:
> > Read RPCs Write RPCs
> > old/current 50K 122K
> > patched 39K 40K
> > --> it reduced the Read RPC count by about 20% and cut the
> > Write RPC count to 1/3rd.
> > I think jhb@ saw pretty good performance results with his patch.
> >
> > Anyhow, the patch is attached and can also be found here:
> > http://people.freebsd.org/~rmacklem/noncontig-write.patch
> >
> > I'd like to invite folks to comment/review/test this patch,
> > since I think it is ready for head/current.
> >
> > Thanks, rick
> > ps: Kostik, maybe you could look at it. In particular, I am
> > wondering if I zero'd out the buffer the correct way, via
> > vfs_bio_bzero_buf()?
> Both vfs_bio_bzero_buf() and plain bzero() would work for NFS client,
> since it does not use unmapped buffers. Use of vfs_bio_bzero_buf()
> is preferred since it makes one less place to find if converting NFS
> to unmapped i/o.
>
> In fact, I do not understand why the zeroing is needed. Could you,
> please,
> comment on it ?
>
Well, if an app. writes a file with holes in it, without the bzeroing
the hole can end up with garbage in it instead of 0s. See the attached
trivial test program I used.
> More, the zeroing seems to overwrite part of the previously valid
> content of the buffer, unless I mis-read the patch. This breaks
> the mmap/file consistency, since buffer pages may be mapped by
> usermode
> process, and it could observe the 'out of thin air' zeroes before the
> writing thread progressed to fill the zeroed part with new content.
>
Well obcount is set to the offset in the block of the current EOF
(np->n_size - lbn * biosize) and the zeroing is from there to the new
size of the buffer. My intent was to only zero out the chunk that is
being "grown" by this write. If that part of the file is already mmap()'d
and could have been written by an app. already, I can see a problem,
but I don't know how it would be fixed?
I'll try and come up with a test case for this. I'll admit I don't
know when the file's size (n_size) gets updated when mmap()'d.
rick
More information about the freebsd-fs
mailing list