RFC: NFS client patch to reduce sychronous writes

Thu Nov 28 19:19:04 UTC 2013

On Thu, Nov 28, 2013 at 09:15:37AM -0500, Rick Macklem wrote:
> Kostik wrote:
> > On Wed, Nov 27, 2013 at 07:19:47PM -0500, Rick Macklem wrote:
> > > Kirk wrote:
> > > > > Date: Wed, 27 Nov 2013 17:50:48 -0500 (EST)
> > > > > From: Rick Macklem <rmacklem at uoguelph.ca>
> > > > > To: Konstantin Belousov <kostikbel at gmail.com>
> > > > > Subject: Re: RFC: NFS client patch to reduce sychronous writes
> > > > > 
> > > > > Kostik wrote:
> > > > >> Sorry, I do not understand the question. mmap(2) itself does
> > > > >> not
> > > > >> change
> > > > >> file size.  But if mmaped area includes the last page, I still
> > > > >> think
> > > > >> that the situation I described before is possible.
> > > > > 
> > > > > Yes, I'll need to look at this. If it is a problem, all I can
> > > > > think
> > > > > of
> > > > > is bzeroing all new pages when they're allocated to the buffer
> > > > > cache.
> > > > > 
> > > > > Thanks for looking at it, rick
> > > > > ps: Btw, jhb@'s patch didn't have the bzeroing in it.
> > > > 
> > > > The ``fix'' of bzero'ing every buffer cache page was made to
> > > > UFS/FFS
> > > > for this problem and it killed write performance of the
> > > > filesystem
> > > > by nearly half. We corrected this by only doing the bzero when
> > > > the
> > > > file is mmap'ed which helped things considerably (since most
> > > > files
> > > > being written are not also bmap'ed).
> > > > 
> > > > 	Kirk
> > > > 
> > > Ok, thanks. I've been trying to reproduce the problem over NFS and
> > > haven't been able to break my patch. I was using the attached
> > > trivial
> > > test program and would simply make a copy of the source file (529
> > > bytes)
> > > to test on. I got the same results both locally and over NFS:
> > > - built without -DWRITEIT, the setting of a value after EOF would
> > > be
> > >   lost, because nothing grew the file from 529 bytes to over
> > >   4080bytes.
> > > - built with -DWRITEIT, both the 'A' and 'B' are in the result,
> > > since
> > >   my patch bzeros the grown segment in the write(2) syscall.
> > > 
> > > - If I move the write (code in #ifdef WRITEIT) to after the "*cp"
> > >   of the mapped page, the 'A' assigned to "*cp" gets lost for
> > >   both UFS and NFS.
> > >   Is this correct behaviour?
> > > 
> > > If it is correct behaviour, I can't see how the patch is broken,
> > > but
> > > if you think it might still be, I'll look at doing what Kirk
> > > suggests,
> > > which is bzeroing all new buffer cache pages when the file is
> > > mmap()d.
> > > 
> > Replying there, since text description is more informative than the
> > code.
> > 
> > You cannot get the situation I described, with single process.
> > You should have a writer in one thread, and reader through the mmaped
> > area in another.  Even than, the race window is thin.
> > 
> > Let me describe the issue which could exist one more time:
> > 
> > Thread A (writer) issued write(2).  The kernel does two things:
> > 1. zeroes part of the last buffer of the affected file.
> > 2. kernel uiomove()s the write data into the buffer b_data.
> > 
> > Now, assume that thread B has the same file mmaped somewhere, and
> > accesses the page of the buffer after the [1] but before [2].  Than,
> > it would see zeroes instead of the valid content.
> > 
> > I said that this breaks write/mmap consistency, since thread B can
> > see a content in the file which was never written there.  The
> > condition
> > is transient, it self-repairs after thread A passes point 2.
> > 
> Ok, but doesn't that exist now?
> 
> Without the patch, when the Thread A (writer) appends to the file,
> the np->n_size is grown and vnode_pager_setsize() is called with
> the new, larger size, followed by an allocbuf().
> { at around line# 1066,1073 of nfs_clbio.c. Same code exists in the
>   old client and was cloned. }
> Then vn_io_fault_uiomove() is called at line#1195.
> 
> Without the patch, Thread B would get whatever garbage is in the
> page(s) if it accesses the latter (just grown) part of the buffer between
> 1 and 2, would it not?
> 
> With the patch, it will most likely get 0s instead of garbage, if
> it accesses the latter (just grown) part of the buffer between 1. and 2.
> There is actually a short time just after the vnode_pager_setsize()
> and before the bzeroing, that it would still get garbage.
> (Note that the patch only zeros out the part of the last page that
>  was "grown" by the write in 1. that is past the previous EOF.
>  If I bzeroing data that was before the previous EOF, I can see
>  the breakage, but the patch doesn't do that. Unless I've written
>  buggy code, of course, but I think it's correct.)
> 
> So, I don't see any difference between the unpatched and patched
> version?
AFAIU, you bzero the whole region in the buffer which is subject to i/o,
not the new extent.  E.g., if the file size mod biosize is non-zero,
but the current write starts on the buffer boundary, than then
range from the buffer start to the old EOF is zeroed for some moment,
before being overwritten by user data.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20131128/351d85d6/attachment.sig>