kern/178997: Heavy disk I/O may hang system

Mon Jun 3 21:20:03 UTC 2013

The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Bruce Evans <brde at optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>
Cc: freebsd-gnats-submit at freebsd.org, freebsd-bugs at freebsd.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Tue, 4 Jun 2013 07:09:59 +1000 (EST)

 On Fri, 31 May 2013, Klaus Weber wrote:

 > Sorry for the late reply, testing took longer than expected.

 This is very interesting.  It will take me a long time to complete
 testing and replying too.  Maybe more later.

 > (I have combined your replies from separate mails into one, and
 > reordered some of the text.)

 Very readable now.

 > ...
 > I have now repeated the tests with several variations of block- and
 > fragment sizes. In all cases, I did two tests:
 >
 > 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
 > 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
 >    bonnie++ -s 64g -n 0 -f -D -d /mnt/t2
 >
 > The dd is simply to give a rough idea of the performance impact of the
 > fs parameters, with the two bonnie++ processes, I was mainly
 > interested in performance and hangs when both bonnie++ processes are
 > in their "Rewriting" phase. I have also tested variations where the
 > block:fragment ratio does not follow the 8:1 recommendation.
 >
 > 64/8k, kernel unpatched:
 > dd: 1218155712 bytes/sec
 > bonnie++: around 300 MB/sec, then drops to 0 and system hangs

 My hardware can barely do 40 MB/sec, but under FreeBSD-5.2 (modified),
 there is at most a 50% drop with 2 bonnies and no tuning, and no hangs.
 Except it hung in my debugging code for understanding a related problem.
 I had this code enabled for a month or 2 and had forgotten about it.
 Only 2 bonnies caused enough load to reach it.

 > 32/4k, kernel unpatched:
 > dd: 844188881 bytes/sec
 > bonnie++: jumps between 25 and 900 MB/sec, no hang
 >
 > 16/2k, kernel unpatched:
 > dd: 517996142 bytes/sec
 > bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of
 >                  400-650 MB/sec, no hang

 The non-hangs might be due to BKVASIZE not being a problem.

 > So while a file system created with the current (32/4k) or old (16/2k)
 > defaults  does prevent the hangs, it also reduces the sequential write
 > performance to 70% and 43% of an 64/8k fs.

 I think bufdaemon and other write-pressure handling code just don't work
 as well as upper layers for clustering and ordering the writes.  They
 are currently broken, as shown by the hangs.  Clustering shouldn't be
 generating so much pressure.  I can explain why it does (see below).
 But random i/o can easily generate the same pressure, so bufdaemon etc.
 need to handle it.  Also, pressure should be the usual case for async
 mounts, and bufdaemon etc. should be able to do clustering and ordering
 _better_ than upper layers, since they have more blocks to work with.
 cluster_write() intentionally leaves as much as possible to bufdaemon
 in order to potentially optimize it.

 > In all cases, the vfs.numdirtybuffers count remained fairly small as
 > long as the bonnie++ processes were writing the testfiles. It rose to
 > vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
 > faster when both processes are rewriting).

 The slow buildup is caused by fragmentation.  If the files happen to
 be laid out contiguously, then cluster_write() acts like bawrite(),
 but better.  With your 64K-blocks and FreeBSD's 128K max cluster size,
 it normally combines just 2 64K-blocks to create a 128K-cluster, and
 does 1 write instead of 2.  After this write, both methods have the
 leave number of dirty buffers (none for data, but a couple for metadata).
 In more detail, the write pattern with bonnie's 8K-blocks is:

    write; bdwrite
    ...
    write; bdwrite  (7 bdwrites to the same buffer)
    write; cluster_write or bawrite (buffer is now full, and dirty iff clusterw)
    write; bdwrite
    ...
    write; bdwrite  (7 bdwrites to the next buffer)
    write; cluster_write or bawrite (2 buffers are now full and both are clean)

 However, with fragmentation, cluster_write() "leaks" a dirty buffer
 although it is full, while bawrite() just writes all full buffers.  The
 leakage is because cluster_write() wants physical contiguity.  This is
 too much to ask for.  The leak may be fast if the file system is very
 fragmented.

 I happened to test on a file system which was almost full and thus very
 fragmented.  Two test files of size 256M each had 452 extents (counting
 indirect blocks as an extent although the might be contiguous).  The
 buildup with my test program (which tries to produce bonnie's i/o pattern)
 was often to the same magic number (something like 768, a little higher
 than the number of extents).

 This much leakage was enough to make clustering a pessimization, although
 it worked to combine 4 blocks in most cases (my fs block size is 16K and
 my max cluster size is 64K).  The disks are old, so 16K is as good an i/o
 size as 64K for them.  But smaller sizes wouldn't be.

 >>> 00-04-57.log:vfs.numdirtybuffers: 52098
 >>> 00-05-00.log:vfs.numdirtybuffers: 52096
 >>> [ etc. ]
 >>
 >> This is a rather large buildup and may indicate a problem.  Try reducing
 >> the dirty buffer watermarks.
 >
 > I have tried to tune system parameters as per your advice, in an attempt
 > to get a 64/8k fs running stable and with reasonable write performance.
 > (dd: results omitted for brevity, all normal for 64/8k fs)

 I got mostly worse behaviour by tuning things related to dirtybuffers.

 > [... You managed to fix the hangs, at a cost of too much performance. ]

 > By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
 > I found that
 > (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
 > less than vfs.hidirtybuffers for reasonable performance without hangs.
 >
 > vfs.numdirtybuffers rises to
 > (vfs.dirtybufthresh) * (number of bonnie++ process)
 > and as long as this is below vfs.hidirtybuffers, the system will not
 > hang.

 I wonder why it is linear in the number of processes.  I saw indications
 of similar behaviour (didn't test extensively).  1 bonnie created
 about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at
 vfs.hidirtybuffers.  This is with vfs.hidirtbuffers much smaller than
 yours.

 > ...
 > I was not sure whether to disable the "bawrite(bp);" in the else part
 > as well. Here is what I used for the next test (in ffs_write):
 >
 >  } else if (vm_page_count_severe() ||
 >              buf_dirty_count_severe() ||
 >              (ioflag & IO_ASYNC)) {
 >          bp->b_flags |= B_CLUSTEROK;
 >          bawrite(bp);

 Note that this bawrite() is used when hidirtybuffers is reached.  This
 is OK for this buffer if this buffer is large, but it doesn't stop
 the flood -- the system stays saturated at hidirtybuffs until bonnie
 stops writing.  This has a bad effect on other parts of the system,
 e.g., if they are writing small buffers.

 >          /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
 >          if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
 >                  bp->b_flags |= B_CLUSTEROK;
 >                  cluster_write(vp, bp, ip->i_size, seqcount);
 >          } else {
 >                  bawrite(bp);
 >                  } KWKWKW */

 OK.

 I tried a bdwrite() for the async mount case.  Async mounts should give
 delayed writes always, but it is a bugfeature that they give delayed
 writes for critical metadata but normally async() writes for data (via
 cluster_write()).  Using bdwrite here let me control the behaviour
 using a combination of mount flags.  It became clearer that the
 bugfeature is not just a bug.  Without cluster_write() forcing out
 writes for async mounts, the write pressure is much the same as from
 bonnie rewrite.

 > During the dd tests, iostat shows a weird, sawtooth-like behavior:
 > 64.00 26730 1670.61   0  0 12  3 85
 > ...

 Some bad interaction of bufdaemon with the writers.  I believe there are
 locking problems, but didn't know anything about them until recently.
 FreeBSD-5.2 is simpler and seems to have fewer of them.  I got a deadlock
 in it by putting a sleep waiting for the write pressure to clear in the
 above.  This gave deadlock, since bufdaemon is apparently prevented from
 doing anything while we hold a lock.

 > ...
 > After reverting the source change, I have decided to try mounting the
 > file system with "-o noclusterr,noclusterw", and re-test. This is
 > equivalent to disabling only the if-part of the expression in the
 > source snippet above.

 Good idea.

 > dd: 1206199580 bytes/sec
 > bonnie++: 550-700 MB/sec, no hang
 >
 > During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
 > and dirtybufthresh are at their defaults:
 > vfs.dirtybufthresh: 46925
 > vfs.hidirtybuffers: 52139
 > vfs.lodirtybuffers: 26069
 > vfs.numdirtybuffers: 15
 >
 > So it looks like you were spot-on by suspecting cluster_write().

 Not for the reasons that I suspected it!

 > Further tests confirmed that "-o noclusterw" is sufficient to prevent
 > the hangs and provide good performance. "-o noclusterr" on its own
 > makes no difference; the system will hang.

 I thought -noclusterw would be worse until now...

 > I have also tested with BKVASIZE set to 65536. As you explained, this
 > reduced the number of buffers:
 > vfs.dirtybufthresh: 11753
 > vfs.hidirtybuffers: 13059
 > vfs.lodirtybuffers: 6529
 >
 > dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
 > with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
 > numdirtybuffers reaches max:
 >
 > vfs.numdirtybuffers: 13059
 >
 > Even though numdirtybuffers reaches hidirtybuffers, the system does
 > not hang, but performance is not very good.

 ... I hoped BKVASIZE would be the problem, but it is at most a secondary
 problem.

 > Here is the relevant part from bonnie++'s source (in C++):
 > ...
 > So you are correct: bonnie++ re-reads the file that was created
 > previously in the "Writing intelligently..." phase in blocks, modifies
 > one byte in the block, and writes the block back.
 >
 > Something in this specific workload is triggering the huge buildup of
 > numdirtybuffers when write-clustering is enabled.

 I can explain this.  Readers and writers share the offset for the
 sequential heuristic (it is per-open-file), so cluster_write() cannot
 tell that bonnie's writes are sequential.  The sequential_heuristic()
 function sees the offset moving back and forth, which normally means
 random access, so it tells cluster_write() that the access is random,
 allthiugh all bonnie seeks back after each read() so that all the
 writes() are sequential.  cluster_write() rarely if ever pushes out
 random writes, so the write pressure builds up to a saturation value
 almost instantly.  Then all subsequent writes work even more poorly
 due to the pressure.  Truly random writes would give similar buildup
 of pressure.

 > I am now looking at vfs_cluster.c to see whether I can find which part
 > is responsible for letting numdirtybuffers raise without bounds and
 > why only *re* writing a file causes problems, not the initial
 > writing. Any suggestions on where to start looking are very welcome.

 It is very complicated, but it was easy to find its comments saying that
 it tries not to force out the writes for non-seqential accesses.  I
 am currently trying the following workarounds:

 % diff -u2 vfs_cluster.c~ vfs_cluster.c
 % --- vfs_cluster.c~	Tue Apr  6 20:11:55 2004
 % +++ vfs_cluster.c	Tue Jun  4 04:53:52 2013
 % @@ -593,4 +714,6 @@
 %  }
 % 
 % +int	buf_dirty_count_large(void);
 % +
 %  /*
 %   * Do clustered write for FFS.

 This is like buf_dirty_count_severe() except it has a lower threshold.
 It doesn't really help.

 % @@ -630,4 +753,44 @@
 %  		vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0;
 % 
 % +	/*
 % +	 * XXX the 13 year old changes in the algorithm to use seqcount
 % +	 * are too agressive.  Avoid them by faking sequential access if
 % +	 * we are under write pressure.  This causes us to write out
 % +	 * clusters in more cases (in particular, when clusters are
 % +	 * complete).

 This doesn't really help.  It's too late to start faking after pressure
 has built up.

 % +	 *
 % +	 * XXX we should probably use a more conservative condition for
 % +	 * write pressure than is currently available.  Try not to leave
 % +	 * anything to bufdaemon, since it can too easily fail to clear
 % +	 * write pressure because the processes creating the write
 % +	 * pressure have the vnode locked.  E.g., use lodirtybuffers
 % +	 * instead of hidirtybuffers in buf_dirty_count_severe(), and
 % +	 * also check dirtybufthesh.

 I use lodirtybuffers as a default.  This is still too high, and almost
 any value reduces performance.

 % +	 *
 % +	 * XXX seqcount itself is broken.  It gets cleared by reads of
 % +	 * the same open file, even if the application seeks back after
 % +	 * reading so that writes are perfectly sequential.  This case
 % +	 * is probably unimportant, but bonnie does it.

 As describe differently above.

 % +	 *
 % +	 * XXX I think that if bonnie didn't seek back then the write
 % +	 * pressure would be the same (provided the bonnie write size is
 % +	 * smaller than the fs block size).  Then the writes wouldn't be
 % +	 * sequential, so a fixed seqcount wouldn't help force them out.
 % +	 * But they need to be forced out under write pressure, and I
 % +	 * don't see anything to do it.  Check this.
 % +	 *
 % +	 * XXX the writing system shouldn't collapse when we get this
 % +	 * slightly wrong.  Why doesn't bdwrite() force things when
 % +	 * we don't?  It has some code to do so.  It is still better
 % +	 * for us to force out full clusters than to leave it to
 % +	 * the writing system to possibly force out sub-buffers.
 % +	 *
 % +	 * XXX try killing async under the same conditions.

 This is too late to reduce pressure from async too.

 % +	 */
 % +	if (vm_page_count_severe() || buf_dirty_count_large()) {
 % +		seqcount = 2;
 % +		async = 0;
 % +	}
 % +
 %  	if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 ||
 %  	    (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) {
 % @@ -719,4 +882,5 @@
 %  		} else {	/* Wait for rest of cluster */
 %  			vp->v_cstart = lbn;
 % +			/* XXX problem if we're async and orig maxclen = 0. */
 %  			bdwrite(bp);
 %  		}

 This case for async breaks the bugfeature of not honoring it.  The
 bugfeature often prevents buildup of exessive pressure, but some cases
 arrive here and then it doesn't.  This is the only old change I have in
 cluster_write().

 % @@ -726,8 +890,13 @@
 %  		 * are operating sequentially, otherwise let the buf or
 %  		 * update daemon handle it.
 % +		 *
 % +		 * Algorithm changeback: ignore seqcount here, at least for
 % +		 * now, to work around readers breaking it for writers.  It
 % +		 * is too late to start ignoring after write pressure builds
 % +		 * up, since not writing out here is the main source of the
 % +		 * buildup.
 %  		 */
 %  		bdwrite(bp);
 % -		if (seqcount > 1)
 % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
 % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
 %  		vp->v_clen = 0;
 %  		vp->v_cstart = lbn + 1;

 This works fairly well.  Changing the algorithm back in all cases reduces
 performance, but here we have usually finished with the buffer.  Even if
 seqcount is fixed so that it isn't clobbered by reads, it might be right
 to ignoe it here, so as to reduce write pressure, until write pressure is
 handled better.

 I used the following program to generate i/o like bonnie's:

 %%%
 #include <err.h>
 #include <fcntl.h>
 #include <stdint.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <unistd.h>

 #define	IOSIZE	8192

 static char buf[IOSIZE] __aligned(4096);

 int
 main(int argc, char **argv)
 {
  	ssize_t n;
  	int ifd, ofd;

  	if (argc != 2 && argc != 3)
  		errx(1, "usage: ./prog file1 [file2]");
  	ifd = open(argv[1], O_RDWR);
  	if (ifd < 0)
  		err(1, "open");
  	if (argc == 3)
  		ofd = open(argv[2], O_RDWR);
  	else
  		ofd = ifd;
  	if (ifd < 0)
  		err(1, "open");
  	for (;;) {
  		n = read(ifd, buf, IOSIZE);
  		if (n == 0) {
  			printf("finished: ioffset %#jx; ooffset = %#jx\n",
  			    (intmax_t)lseek(ifd, 0, SEEK_CUR),
  			    (intmax_t)lseek(ofd, 0, SEEK_CUR));
  			exit(0);
  		}
  		if (n < 0)
  			err(1, "read error");
  		if (n != IOSIZE)
  			errx(1, "short read %d", (int)n);
  		if (ofd == ifd)
  			lseek(ofd, -IOSIZE, SEEK_CUR);
  		n = write(ofd, buf, IOSIZE);
  		if (n < 0)
  			err(1, "write error");
  		if (n != IOSIZE)
  			errx(1, "short write %d", (int)n);
  	}
 }
 %%%

 Usage: create a test file zz of a suitable size (the program normally
 doesn't extend it); then ./prog zz does read/write like bonnie;
 ./prog zz zz does the same with independent file positions.

 Bruce