NFS calculation of max commit size

Wed Aug 17 13:15:17 UTC 2011

Jeremy Chadwick wrote:
> On Tue, Aug 16, 2011 at 09:31:35AM -0400, John Baldwin wrote:
> > On Monday, August 15, 2011 10:25:54 pm Jeremy Chadwick wrote:
> > > On Mon, Aug 15, 2011 at 06:58:14PM -0400, Rick Macklem wrote:
> > > > John Baldwin wrote:
> > > > > On Sunday, August 07, 2011 6:47:46 pm Rick Macklem wrote:
> > > > > > A recent PR (kern/159351) noted that the following
> > > > > > calculation results in a divide-by-zero when
> > > > > > desiredvnodes < 1000.
> > > > > >
> > > > > > 	nmp->nm_wcommitsize = hibufspace / (desiredvnodes / 1000);
> > > > > >
> > > > > > Just fixing the divide-by-zero is easy enough, but I'm not
> > > > > > sure what this calculation is trying to do. Making it a
> > > > > > fraction
> > > > > > of "hibufspace" makes sense (nm_wcommitsize is the maximum #
> > > > > > of
> > > > > > bytes of uncommitted data in the NFS client's buffer cache
> > > > > > blocks,
> > > > > > if I understand it correctly), but why divide it by
> > > > > >
> > > > > >                 (desiredvnodes / 1000) ??
> > > > > >
> > > > > > Maybe thinking that fewer vnodes means sharing it with fewer
> > > > > > other file systems or ???
> > > > > >
> > > > > > Anyhow, it seems to me that the formulae is bogus for small
> > > > > > values of desiredvnodes (for example desiredvnodes == 1500
> > > > > > implies nm_wcommitsize == hibufspace, which sounds too large
> > > > > > to me).
> > > > > >
> > > > > > I'm thinking that putting an upper limit of 10% of
> > > > > > hibufspace
> > > > > > might make sense. ie. Change the above to:
> > > > > >
> > > > > > 	if (desiredvnodes >= 11000)
> > > > > > 		nmp->nm_wcommitsize = hibufspace / (desiredvnodes / 1000);
> > > > > > 	else
> > > > > > 		nmp->nm_wcommitsize = hibufspace / 10;
> > > > > >
> > > > > > Anyone have comments or insight into this calculation?
> > > > > >
> > > > > > rick
> > > > > > ps: jhb, I hope you don't mind. I emailed you first and then
> > > > > >     thought others might have some ideas, too.
> > > > >
> > > > > Oh no, this is fine. A broader discussion is probably
> > > > > warranted. I
> > > > > honestly
> > > > > don't know what the goal is. I do think it is an attempt to
> > > > > share with
> > > > > other
> > > > > file systems, but I'm not sure how desiredvnodes / 1000 is
> > > > > useful for
> > > > > that.
> > > > > It also seems that we can end up setting this woefully low as
> > > > > well.
> > > > > That is,
> > > > > I wonder if we need a minimum of 10% of hibufspace so that it
> > > > > can
> > > > > scale
> > > > > between 10% and 90% of hibufspace (but I'm not sure what you
> > > > > would use
> > > > > to
> > > > > pick the scaling factor sanely). To my mind what you really
> > > > > want to do
> > > > > is
> > > > > something like 'hibufspace / (number of active mounts)', but
> > > > > that will
> > > > > not
> > > > > really work correctly unless we recalculate the value on each
> > > > > mount
> > > > > and
> > > > > unmount operation.
> > > > >
> > > > > --
> > > > > John Baldwin
> > > > Btw, this was done by r147280 6.5years ago, so the formula
> > > > doesn't seem
> > > > to be causing a lot of grief. Also of some interest is the fact
> > > > that
> > > > wcommitsize appears to have been setable on a
> > > > per-mount-point-basis until
> > > > mount_nfs(8) was converted to nmount(2). { There is no nmount
> > > > option to set it. }
> > > >
> > > > Btw, when nm_wcommitsize is exceeded, writes become synchronous,
> > > > so it affects
> > > > how much write behind happens. This, in turn, affects how bursty
> > > > (is this a real
> > > > word? hopefully you get what I mean?) the write traffic to the
> > > > server is.
> > > >
> > > > What I'm not sure about is what happens when multiple mounts use
> > > > up the entire
> > > > buffer cache with write behinds. I'll try a little experiment to
> > > > see if I
> > > > can find that out. (If making it large isn't detrimental, then I
> > > > tend to
> > > > agree that the above sets nm_wcommitsize very small.)
> > > >
> > > > Since "desiredvnodes" will seldom be less than 1000, I'm not
> > > > going to
> > > > rush to a solution.
> > > >
> > > > Anyone who has insight into what this formula should be, please
> > > > let us know.
> > >
> > > The commit message tries to explain it, but it's more than just a
> > > one-line change.
> > >
> > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/nfsclient/nfs_vfsops.c#rev1.177
> > >
> > > There's also an associated PR:
> > >
> > > http://www.freebsd.org/cgi/query-pr.cgi?pr=79208
> >
> > The commit added the limit which is sensible, but it doesn't explain
> > the logic
> > for how the limit is computed (that is, why it uses desiredvnodes /
> > 1000).
> 
> Understood -- what I was getting at was that the individuals
> responsible
> for the commit (there were multiples who reviewed it) could be
> contacted
> and inquiries submit. :-)
> 
I did email the original committer and have not heard back. (I didn't
try the reviewer(s).)

I'm going to start doing a little experimentation with this and will
report back when I have something that might be of interest.

I think that any fraction of hibufspace should be sufficient to avoid
the deadlock. Also, since the buffer cache code doesn't use vnode locking
these days, I'm not even sure if write backs are blocked by the wrire
vnode op in progress. (ie. I'm not sure the deadlock it originally fixed
would still happen without it.)

rick

> --
> | Jeremy Chadwick jdc at parodius.com |
> | Parodius Networking http://www.parodius.com/ |
> | UNIX Systems Administrator Mountain View, CA, US |
> | Making life hard for others since 1977. PGP 4BD6C0CB |