NFS calculation of max commit size

Wed Aug 17 22:18:55 UTC 2011

Kostik Belousov wrote:
> On Wed, Aug 17, 2011 at 09:15:15AM -0400, Rick Macklem wrote:
> >
> > I think that any fraction of hibufspace should be sufficient to
> > avoid
> > the deadlock. Also, since the buffer cache code doesn't use vnode
> > locking
> > these days, I'm not even sure if write backs are blocked by the
> > wrire
> > vnode op in progress. (ie. I'm not sure the deadlock it originally
> > fixed
> > would still happen without it.)
> 
> bufdaemon definitely acquires vnode lock when flushing dirty buffer,
> this was a problem on its own. I think you refer to the nfsiod
> operation.
> 
Ok, so I think this means that the deadlock can still occur.
I haven't yet played with the code, but I now think I might unedrstand
the logic behind dividing by "(desiredvnodes / 1000)".

If a single large write is happening to one NFS vnode, setting
nm_wcommitsize to any fraction of hibufspace should avoid the deadlock,
I think. (If I understand it correctly, the deadlock occurs when an
NFS VOP_WRITE() runs out of buffer cache and no buffer cache blocks
can be cleaned out because it is holding a lock on the vnode.)

But, what happens if K processes concurrently do large writes on K
NFS vnodes?
- It seems to me that they call could deadlock when the buffer cache
  becomes exhausted, since they all hold locks on their respective
  vnodes and, therefore, none of the dirty buffers can be flushed.
  - If this is correct, then I think the only "safe" answer is:
     nm_wcommitsize = hibufspace / desiredvnodes;
    since it is possible that almost all vnodes could be assigned to
    NFS files being written concurrently with large writes.
  However, this would result in an absurdly low value for nm_wcommitsize.

--> My best guess is the original author assumed that 0.1% of all vnodes
    would be a reasonable upper bound on the number being written by NFS
    concurrently with large writes.

By the way, since nm_wcommitsize is applied to a single write, it only
affects a single write(2) syscall of more than nm_wcommitsize bytes of
data. (The PR refers to a writev() of 60Mbytes in size.)
I honestly have no idea how many apps. do write() syscalls of megabytes
in size, so I'm not sure how important it would be to make it larger
than "hibufspace / (desiredvnodes / 1000)", which is about 2Mbytes on
the 256Mbyte laptop I have here without any tuning tweaks?

I think there might be a better way to do this than calculating a
fixed "guestimate" for nm_wcommitsize and then using it for the life
of the NFS mount.
- The NFS VOP_WRITE() can keep track of a running total of how many
  bytes is being written:
  - add uio_resid to this running total at the beginning of the VOP_WRITE()
    and subtract it back out at the end of VOP_WRITE().
  - if this running total exceeds something like 80% of hibufspace, then
    do synchronous writes (ie. use that test instead of
        if (nm_wcommitsize < uio->uio_resid) to make the decision.

Does this sound reasonable to others?
(This is actually getting interesting. Who would have guessed that a
 divide by zero bug report would lead to this...)

rick
> There is another op that is performed without holding the vnode lock
> consistently from (old)nfs code, namely, truncation. It would be
> useful
> to fix this. Please see r188386.