kern/178997: Heavy disk I/O may hang system
Klaus Weber
fbsd-bugs-2013-1 at unix-admin.de
Sun Jul 21 18:00:01 UTC 2013
The following reply was made to PR kern/178997; it has been noted by GNATS.
From: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>
To: Bruce Evans <brde at optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>,
freebsd-gnats-submit at FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Sun, 21 Jul 2013 19:42:15 +0200
Sorry for the late reply, did lots of testing. I believe to have
identified the root cause for the buffer depletion and subsequent i/o
hangs.
Have a look at the beginning of ffs_bdflush() in ffs_snapshot.c:
if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
return;
This short-circuits the entire flushing code if _this bufobj_ is using
less than vfs.dirtybufthresh buffers (by default, vfs.dirtybufthresh
is 90% of all available buffers). In other words, one bufobj may hog
up to 90% of all buffers.
This works well if only one file is being re-written. As documented
up-thread, with just one file vfs.numdirtybuffers rises until it
reaches vfs.dirtybufthresh, and then remains relatively stable at that
point, and the system will not hang.
With two (or more) files being rewritten, the code above tries to let
_each_ file have 90% of all available buffers, which lets
vfs.numdirtybuffers hit vfs.hidirtybuffers.
Even worse, with two processes competing for buffer space, often both
remain under the magic 90% mark, effectively disabling all flushing of
buffers entirely. At that point, the only way that buffers can be freed
is flushing in different code paths, or if other processes not
involved in this buffer space competion are releasing some. Sooner or
later, the system reaches a state where none of the bufobjs has more
than 90% of the buffers, and none of the other (normal) processes can
release any buffers. The only way out then is a hard reset.
With the following patch (also attached), the system performs
reasonably well and does not hang:
Index: sys/kern/vfs_bio.c
===================================================================
--- sys/kern/vfs_bio.c (revision 253339)
+++ sys/kern/vfs_bio.c (working copy)
@@ -1146,7 +1146,7 @@
if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
(void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
altbufferflushes++;
- } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
+ } else if ( bdflush_required(bo) ) {
BO_LOCK(bo);
/*
* Try to find a buffer to flush.
@@ -1438,6 +1438,28 @@
}
/*
+ * Flushing is deemed necessary if at least one of the following is true:
+ * - bo is hogging more than dirtybufthresh buffers (previous behavior)
+ * - numdirtybuffers exceeds dirtybufthresh
+ * - we are or act as bufdaemon (TDP_NORUNNINGBUF)
+ *
+ * Used by bufbdflush (above) and ffs_bdflush (in ffs_snapshot).
+ */
+int
+bdflush_required(struct bufobj *bo)
+{
+ struct thread *td = curthread;
+
+ KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
+
+ if ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
+ (numdirtybuffers > dirtybufthresh) ||
+ (td && (td->td_pflags & TDP_NORUNNINGBUF)) )
+ return(1);
+ return(0);
+ }
+
+/*
* brelse:
*
* Release a busy buffer and, if requested, free its resources. The
Index: sys/sys/bio.h
===================================================================
--- sys/sys/bio.h (revision 253339)
+++ sys/sys/bio.h (working copy)
@@ -151,6 +151,10 @@
#define physread physio
#define physwrite physio
+extern struct bufobj *bo;
+int bdflush_required(struct bufobj *bo);
+
+
#endif /* _KERNEL */
#endif /* !_SYS_BIO_H_ */
Index: sys/ufs/ffs/ffs_snapshot.c
===================================================================
--- sys/ufs/ffs/ffs_snapshot.c (revision 253339)
+++ sys/ufs/ffs/ffs_snapshot.c (working copy)
@@ -2161,7 +2161,7 @@
struct buf *nbp;
int bp_bdskip;
- if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
+ if ( !bdflush_required(bo) )
return;
td = curthread;
I have modified ffs_bdflush() in ffs_snapshot.c and the generic
bufbdflush() in vfs_bio.c, which contains similar logic. (I believe a
kernel of 5.2 vintage will have this code still in bdwrite().)
While the patch makes rewriting work reliably, it may need some
further refinements:
- Right now, it limits numdirtybuffers to dirtybufthresh. It might be
better to use hidirtybuffers as a limit instead?
- the set of conditions for which the flush is performed needs to be
reviewed by someone who understands this better than I do. Limiting
numdirtybuffers is neccessary to fix the hangs. Limiting
bo->bo_dirty.bv_cnt was what the code did before, so I left this in
as well. And finally, I'm testing for TDP_NORUNNINGBUF based on the
idea that if bufdaemon, syncer or any process acting on their behalf
sees the need to flush buffers, then it should not be up to *bdflush()
to second-guess their intention.
- the patch may have style issues
Regarding the threshold and conditions: I am still in the process of
doing some basic benchmarks with and without the patch. However, it
seems that it causes a slight performance decrease in the cases that
work well without the patch (i.e. only one process re-writing). My
guess is that it now flushes more often than absolutely
neccessary.
Performance is unchanged for cases that do not cause a huge buffer
buildup (reading and writing via separate file descriptors). Of
course, in cases that did result in hangs before and now don't, the
patch is a clear win.
I still believe there are some bugs in the clustering code that affect
this case, making the re-writing less efficient than it could be.
I wish I had the time to do more testing, but I need to bring the
server into production soon. I'm putting some notes here for
things I intended to test, but probably won't have the time to:
- in vfs_cluster.c, cluster_write_gb(), where it says "Change to
algorithm...": Also push the previous cluster when it has reached
maximum size, not only when seqcount > 0. No point waiting for
buf_daemon when the cluster cannot grow. Verify that condition
"vp->v_clen <= cursize" really tests "reached max. size". XXX
"maxclen <= cursize"? Add rate-limited printf()s!
- create a new "cluster_awrite", from DragonFly BSD:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/9d4e78c77684f1de021eb8ae17767c210205c3c3
(plug-compatible to bawrite(), but with clustering). Do not replace
vfs_bio_awrite (yet). Then, selectively replace some bawrite() calls
with calls to cluster_awrite() to identify where current bawrite()
calls are inefficient. XXX requires mods to cluster_wbuild()!
If anyone wants me to test any additional things, or variants of the
patch above, I can still do this next weekend.
Since the system no longer hangs with the patch, I think this bug can
now be closed.
Klaus
More information about the freebsd-bugs
mailing list