kern/178997: Heavy disk I/O may hang system

Tue May 28 12:10:02 UTC 2013

The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Bruce Evans <brde at optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>
Cc: Bruce Evans <brde at optusnet.com.au>, freebsd-gnats-submit at FreeBSD.org, 
    freebsd-bugs at FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Tue, 28 May 2013 22:03:10 +1000 (EST)

 On Mon, 27 May 2013, Klaus Weber wrote:

 > On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
 >> On Sun, 26 May 2013, Klaus Weber wrote:
 >>>> Description:
 >>> During heavy disk I/O (two bonnie++ processes working on the same disk
 >>> simultaneously) causes an extreme degradation in disk throughput (combined
 >>> throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows
 >>> when both bonnie++ processes are in the "Rewriting..." phase.
 >
 >> Please use the unix newline character in mail.
 >
 > My apologies. I submitted the report via the web-interface and did not
 > realize that it would come out this way.

 Thanks.  The log output somehow came out right.

 >> I found that
 >> the problem could be fixed by killing cluster_write() by turning it into
 >> bdwrite() (by editing the running kernel using ddb, since this is easier
 >> than rebuilding the kernel).  I was trying many similar things since I
 >> had a theory that cluster_write() is useless.  [...]
 >
 > If that would provide a useful datapoint, I could try if that make a
 > difference on my system. What changes would be required to test this?
 >
 > Surely its not as easy as replacing the function body of
 > cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?

 That should work for testing, but it is safer to edit ffs_write()
 and remove the block where it calls cluster_write() (or bawrite()),
 so that it falls through to call bdwrite() in most cases.

 >> My theory for what the bug is is that
 >> cluster_write() and cluster_read() share the limit resource of pbufs.
 >> pbufs are not managed as carefully as normal buffers.  In particular,
 >> there is nothing to limit write pressure from pbufs like there is for
 >> normal buffers.
 >
 > Is there anything I can do to confirm rebut this? Is the number of
 > pbufs in use visible via a sysctl, or could I add debug printfs that
 > are triggered when certain limits are reached?

 Here I don't really know what to look for.  First add a sysctl to read
 the number of free pbufs.  The variable for this is cluster_pbuf_freecnt
 in vm.

 >>> newfs -b 64k -f 8k /dev/da0p1
 >>
 >> The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
 >> Someone increased the default from 16k to 32k without changing the buffer
 >> cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
 >> small, but on 32-bit arches kernel virtual memory is too limited to have
 >> a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
 >> although this problem doesn't affetc 64-bit arches.
 >>
 >> -b 64k is worse.
 >
 > Thank you for this explanation. I was not aware that -b 64k (or even
 > the default values to newfs) would have this effect. I will repeat the
 > tests with 32/4k and 16/2k, although I seem to remember that 64/8k
 > provided a significant performance boost over the defaults. This, and
 > the reduced fsck times was my original motivation to go with the
 > larger values.

 The reduced fsck time and perhaps the reduced number of cylinder groups
 are the main advantages of large clusters.  vfs-level clustering turns
 most physical i/o's into 128K-blocks (especially for large files) so
 there is little difference between the i/o speed for all fs block sizes
 unless the fs block size is very small.

 > Given the potentially drastic effects of block sizes other than 16/2k,
 > maybe a warning should be added to the newfs manpage? I only found the
 > strong advice to maintain a 8:1 buffer:fragment ratio.

 Once the kernel misconfiguration understood eniygh for such a warning to
 not be FUD, it should be easy to fix.

 >>> When both bonnie++ processes are in their "Rewriting" phase, the system
 >>> hangs within a few seconds. Both bonnie++ processes are in state "nbufkv".
 >>> bufdaemon takes about 40% CPU time and is in state "qsleep" when not
 >>> active.
 >>
 >> You got the buffer cache fragmentation that you asked for.
 >
 > Looking at vfs_bio.c, I see that it has defrag-code in it. Should I
 > try adding some debug output to this code to get some insight why this
 > does not work, or not as effective as it should?

 Don't start there, since it is complicated and timing-dependent.  Maybe
 add some printfs to make it easy to see when it enters and leaves defrag
 mode.

 >> Apparently you found a way to reproduce the serious fragmentaion
 >> problems.
 >
 > A key factor seems to be the "Rewriting" operation. I see no problem
 > during the "normal" writing, nor could I reproduce it with concurrent
 > dd runs.

 I don't know exactly what bonnie rewrite bmode does.  Is it just read/
 [modify]/write of sequential blocks with a fairly small block size?
 Old bonnie docs say that the block size is always 8K.  One reason I
 don't like bonnie.  Clustering should work fairly normally with that.
 Anything with random seeks would break clustering.

 >> Increasing BKVASIZE would take more work than this, since although it
 >> was intended to be a system parameter which could be changed to reduce
 >> the fragmentation problem, one of the many bugs in it is that it was
 >> never converted into a "new" kernel option.  Another of the bugs in
 >> it is that doubling it halves the number of buffers, so doubling it
 >> does more than use twice as much kva.  This severely limited the number
 >> of buffers back when memory sizes were 64MB.  It is not a very
 >> significant limitation if the memory size is 1GB or larger.
 >
 > Should I try to experiment with BKVASIZE of 65536? If so, can I
 > somehow up the number of buffers again? Also, after modifying
 > BKVASIZE, is it sufficient to compile and install a new kernel, or do
 > I have to build and install the entire world?

 Just the kernel, but changing sys/param.h will make most of the world
 want to recompile itself according to dependencies.  I don't like rebuilding
 things, and often set timestamps in header files back to what they were
 to avoid rebuilding (after rebuilding only the object files that actually
 depend on the change).  Use this hack with caution, or rebuild kernels in
 a separate tree that doesn't affect the world.

 >>> [second bonnie goes Rewriting as well]
 >>> 00-04-24.log:vfs.numdirtybuffers: 11586
 >>> 00-04-25.log:vfs.numdirtybuffers: 16325
 >>> 00-04-26.log:vfs.numdirtybuffers: 24333
 >>> ...
 >>> 00-04-54.log:vfs.numdirtybuffers: 52096
 >>> 00-04-57.log:vfs.numdirtybuffers: 52098
 >>> 00-05-00.log:vfs.numdirtybuffers: 52096
 >>> [ etc. ]
 >>
 >> This is a rather large buildup and may indicate a problem.  Try reducing
 >> the dirty buffer watermarks.  Their default values are mostly historical
 >> nonsense.
 >
 > You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be
 > reasonable starting values for experimenting? 800/200?

 1000 or 10000 (if nbuf is 50000).  1000 is probably too conservative, but
 I think it is plenty for most loads.

 Bruce