kern/178997: Heavy disk I/O may hang system

Sun Jul 21 18:00:01 UTC 2013

The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>
To: Bruce Evans <brde at optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>,
	freebsd-gnats-submit at FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Sun, 21 Jul 2013 19:42:15 +0200

 Sorry for the late reply, did lots of testing. I believe to have
 identified the root cause for the buffer depletion and subsequent i/o
 hangs.

 Have a look at the beginning of ffs_bdflush() in ffs_snapshot.c:

         if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
 	                 return;

 This short-circuits the entire flushing code if _this bufobj_ is using
 less than vfs.dirtybufthresh buffers (by default, vfs.dirtybufthresh
 is 90% of all available buffers). In other words, one bufobj may hog
 up to 90% of all buffers.

 This works well if only one file is being re-written. As documented
 up-thread, with just one file vfs.numdirtybuffers rises until it 
 reaches vfs.dirtybufthresh, and then remains relatively stable at that
 point, and the system will not hang.

 With two (or more) files being rewritten, the code above tries to let
 _each_ file have 90% of all available buffers, which lets
 vfs.numdirtybuffers hit vfs.hidirtybuffers.

 Even worse, with two processes competing for buffer space, often both
 remain under the magic 90% mark, effectively disabling all flushing of
 buffers entirely. At that point, the only way that buffers can be freed
 is flushing in different code paths, or if other processes not
 involved in this buffer space competion are releasing some. Sooner or
 later, the system reaches a state where none of the bufobjs has more
 than 90% of the buffers, and none of the other (normal) processes can
 release any buffers. The only way out then is a hard reset.

 With the following patch (also attached), the system performs
 reasonably well and does not hang:

 Index: sys/kern/vfs_bio.c
 ===================================================================
 --- sys/kern/vfs_bio.c  (revision 253339)
 +++ sys/kern/vfs_bio.c  (working copy)
 @@ -1146,7 +1146,7 @@
         if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
                 (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
                 altbufferflushes++;
 -       } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
 +       } else if ( bdflush_required(bo) ) {
                 BO_LOCK(bo);
                 /*
                  * Try to find a buffer to flush.
 @@ -1438,6 +1438,28 @@
  }

  /*
 + * Flushing is deemed necessary if at least one of the following is true:
 + * - bo is hogging more than dirtybufthresh buffers (previous behavior)
 + * - numdirtybuffers exceeds dirtybufthresh
 + * - we are or act as bufdaemon (TDP_NORUNNINGBUF)
 + *
 + * Used by bufbdflush (above) and ffs_bdflush (in ffs_snapshot).
 + */
 +int
 +bdflush_required(struct bufobj *bo)
 +{
 +  struct thread *td = curthread;
 +
 +  KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
 +
 +  if ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
 +      (numdirtybuffers > dirtybufthresh) ||
 +      (td && (td->td_pflags & TDP_NORUNNINGBUF)) )
 +    return(1);
 +  return(0);
 +  }
 +
 +/*
   *     brelse:
   *
   *     Release a busy buffer and, if requested, free its resources.  The
 Index: sys/sys/bio.h
 ===================================================================
 --- sys/sys/bio.h       (revision 253339)
 +++ sys/sys/bio.h       (working copy)
 @@ -151,6 +151,10 @@
  #define physread physio
  #define physwrite physio

 +extern struct bufobj *bo;
 +int bdflush_required(struct bufobj *bo);
 +
 +
  #endif /* _KERNEL */

  #endif /* !_SYS_BIO_H_ */
 Index: sys/ufs/ffs/ffs_snapshot.c
 ===================================================================
 --- sys/ufs/ffs/ffs_snapshot.c  (revision 253339)
 +++ sys/ufs/ffs/ffs_snapshot.c  (working copy)
 @@ -2161,7 +2161,7 @@
         struct buf *nbp;
         int bp_bdskip;

 -       if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
 +       if ( !bdflush_required(bo) )
                 return;

         td = curthread;

 I have modified ffs_bdflush() in ffs_snapshot.c and the generic
 bufbdflush() in vfs_bio.c, which contains similar logic. (I believe a
 kernel of 5.2 vintage will have this code still in bdwrite().)

 While the patch makes rewriting work reliably, it may need some
 further refinements:

 - Right now, it limits numdirtybuffers to dirtybufthresh. It might be
   better to use hidirtybuffers as a limit instead?

 - the set of conditions for which the flush is performed needs to be
   reviewed by someone who understands this better than I do. Limiting
   numdirtybuffers is neccessary to fix the hangs. Limiting
   bo->bo_dirty.bv_cnt was what the code did before, so I left this in
   as well. And finally, I'm testing for TDP_NORUNNINGBUF based on the
   idea that if bufdaemon, syncer or any process acting on their behalf
   sees the need to flush buffers, then it should not be up to *bdflush()
   to second-guess their intention.

 - the patch may have style issues

 Regarding the threshold and conditions: I am still in the process of
 doing some basic benchmarks with and without the patch. However, it
 seems that it causes a slight performance decrease in the cases that
 work well without the patch (i.e. only one process re-writing). My
 guess is that it now flushes more often than absolutely
 neccessary. 

 Performance is unchanged for cases that do not cause a huge buffer
 buildup (reading and writing via separate file descriptors). Of
 course, in cases that did result in hangs before and now don't, the
 patch is a clear win.

 I still believe there are some bugs in the clustering code that affect
 this case, making the re-writing less efficient than it could be.

 I wish I had the time to do more testing, but I need to bring the
 server into production soon. I'm putting some notes here for
 things I intended to test, but probably won't have the time to:

 - in vfs_cluster.c, cluster_write_gb(), where it says "Change to
   algorithm...": Also push the previous cluster when it has reached
   maximum size, not only when seqcount > 0. No point waiting for
   buf_daemon when the cluster cannot grow. Verify that condition
   "vp->v_clen <=  cursize" really tests "reached max. size". XXX
   "maxclen <=  cursize"?  Add rate-limited printf()s!

 - create a new "cluster_awrite", from DragonFly BSD:
   http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/9d4e78c77684f1de021eb8ae17767c210205c3c3
   (plug-compatible to bawrite(), but with clustering). Do not replace
   vfs_bio_awrite (yet). Then, selectively replace some bawrite() calls
   with calls to cluster_awrite() to identify where current bawrite()
   calls are inefficient. XXX requires mods to cluster_wbuild()!

 If anyone wants me to test any additional things, or variants of the
 patch above, I can still do this next weekend. 

 Since the system no longer hangs with the patch, I think this bug can
 now be closed.

 Klaus