kern/178997: Heavy disk I/O may hang system

Mon Jun 3 21:10:11 UTC 2013

On Fri, 31 May 2013, Klaus Weber wrote:

> Sorry for the late reply, testing took longer than expected.

This is very interesting.  It will take me a long time to complete
testing and replying too.  Maybe more later.

> (I have combined your replies from separate mails into one, and
> reordered some of the text.)

Very readable now.

> ...
> I have now repeated the tests with several variations of block- and
> fragment sizes. In all cases, I did two tests:
>
> 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
> 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
>    bonnie++ -s 64g -n 0 -f -D -d /mnt/t2
>
> The dd is simply to give a rough idea of the performance impact of the
> fs parameters, with the two bonnie++ processes, I was mainly
> interested in performance and hangs when both bonnie++ processes are
> in their "Rewriting" phase. I have also tested variations where the
> block:fragment ratio does not follow the 8:1 recommendation.
>
> 64/8k, kernel unpatched:
> dd: 1218155712 bytes/sec
> bonnie++: around 300 MB/sec, then drops to 0 and system hangs

My hardware can barely do 40 MB/sec, but under FreeBSD-5.2 (modified),
there is at most a 50% drop with 2 bonnies and no tuning, and no hangs.
Except it hung in my debugging code for understanding a related problem.
I had this code enabled for a month or 2 and had forgotten about it.
Only 2 bonnies caused enough load to reach it.

> 32/4k, kernel unpatched:
> dd: 844188881 bytes/sec
> bonnie++: jumps between 25 and 900 MB/sec, no hang
>
> 16/2k, kernel unpatched:
> dd: 517996142 bytes/sec
> bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of
>                  400-650 MB/sec, no hang

The non-hangs might be due to BKVASIZE not being a problem.

> So while a file system created with the current (32/4k) or old (16/2k)
> defaults  does prevent the hangs, it also reduces the sequential write
> performance to 70% and 43% of an 64/8k fs.

I think bufdaemon and other write-pressure handling code just don't work
as well as upper layers for clustering and ordering the writes.  They
are currently broken, as shown by the hangs.  Clustering shouldn't be
generating so much pressure.  I can explain why it does (see below).
But random i/o can easily generate the same pressure, so bufdaemon etc.
need to handle it.  Also, pressure should be the usual case for async
mounts, and bufdaemon etc. should be able to do clustering and ordering
_better_ than upper layers, since they have more blocks to work with.
cluster_write() intentionally leaves as much as possible to bufdaemon
in order to potentially optimize it.

> In all cases, the vfs.numdirtybuffers count remained fairly small as
> long as the bonnie++ processes were writing the testfiles. It rose to
> vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
> faster when both processes are rewriting).

The slow buildup is caused by fragmentation.  If the files happen to
be laid out contiguously, then cluster_write() acts like bawrite(),
but better.  With your 64K-blocks and FreeBSD's 128K max cluster size,
it normally combines just 2 64K-blocks to create a 128K-cluster, and
does 1 write instead of 2.  After this write, both methods have the
leave number of dirty buffers (none for data, but a couple for metadata).
In more detail, the write pattern with bonnie's 8K-blocks is:

   write; bdwrite
   ...
   write; bdwrite  (7 bdwrites to the same buffer)
   write; cluster_write or bawrite (buffer is now full, and dirty iff clusterw)
   write; bdwrite
   ...
   write; bdwrite  (7 bdwrites to the next buffer)
   write; cluster_write or bawrite (2 buffers are now full and both are clean)

However, with fragmentation, cluster_write() "leaks" a dirty buffer
although it is full, while bawrite() just writes all full buffers.  The
leakage is because cluster_write() wants physical contiguity.  This is
too much to ask for.  The leak may be fast if the file system is very
fragmented.

I happened to test on a file system which was almost full and thus very
fragmented.  Two test files of size 256M each had 452 extents (counting
indirect blocks as an extent although the might be contiguous).  The
buildup with my test program (which tries to produce bonnie's i/o pattern)
was often to the same magic number (something like 768, a little higher
than the number of extents).

This much leakage was enough to make clustering a pessimization, although
it worked to combine 4 blocks in most cases (my fs block size is 16K and
my max cluster size is 64K).  The disks are old, so 16K is as good an i/o
size as 64K for them.  But smaller sizes wouldn't be.

>>> 00-04-57.log:vfs.numdirtybuffers: 52098
>>> 00-05-00.log:vfs.numdirtybuffers: 52096
>>> [ etc. ]
>>
>> This is a rather large buildup and may indicate a problem.  Try reducing
>> the dirty buffer watermarks.
>
> I have tried to tune system parameters as per your advice, in an attempt
> to get a 64/8k fs running stable and with reasonable write performance.
> (dd: results omitted for brevity, all normal for 64/8k fs)

I got mostly worse behaviour by tuning things related to dirtybuffers.

> [... You managed to fix the hangs, at a cost of too much performance. ]

> By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
> I found that
> (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
> less than vfs.hidirtybuffers for reasonable performance without hangs.
>
> vfs.numdirtybuffers rises to
> (vfs.dirtybufthresh) * (number of bonnie++ process)
> and as long as this is below vfs.hidirtybuffers, the system will not
> hang.

I wonder why it is linear in the number of processes.  I saw indications
of similar behaviour (didn't test extensively).  1 bonnie created
about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at
vfs.hidirtybuffers.  This is with vfs.hidirtbuffers much smaller than
yours.

> ...
> I was not sure whether to disable the "bawrite(bp);" in the else part
> as well. Here is what I used for the next test (in ffs_write):
>
>  } else if (vm_page_count_severe() ||
>              buf_dirty_count_severe() ||
>              (ioflag & IO_ASYNC)) {
>          bp->b_flags |= B_CLUSTEROK;
>          bawrite(bp);

Note that this bawrite() is used when hidirtybuffers is reached.  This
is OK for this buffer if this buffer is large, but it doesn't stop
the flood -- the system stays saturated at hidirtybuffs until bonnie
stops writing.  This has a bad effect on other parts of the system,
e.g., if they are writing small buffers.

>          /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
>          if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
>                  bp->b_flags |= B_CLUSTEROK;
>                  cluster_write(vp, bp, ip->i_size, seqcount);
>          } else {
>                  bawrite(bp);
>                  } KWKWKW */

OK.

I tried a bdwrite() for the async mount case.  Async mounts should give
delayed writes always, but it is a bugfeature that they give delayed
writes for critical metadata but normally async() writes for data (via
cluster_write()).  Using bdwrite here let me control the behaviour
using a combination of mount flags.  It became clearer that the
bugfeature is not just a bug.  Without cluster_write() forcing out
writes for async mounts, the write pressure is much the same as from
bonnie rewrite.

> During the dd tests, iostat shows a weird, sawtooth-like behavior:
> 64.00 26730 1670.61   0  0 12  3 85
> ...

Some bad interaction of bufdaemon with the writers.  I believe there are
locking problems, but didn't know anything about them until recently.
FreeBSD-5.2 is simpler and seems to have fewer of them.  I got a deadlock
in it by putting a sleep waiting for the write pressure to clear in the
above.  This gave deadlock, since bufdaemon is apparently prevented from
doing anything while we hold a lock.

> ...
> After reverting the source change, I have decided to try mounting the
> file system with "-o noclusterr,noclusterw", and re-test. This is
> equivalent to disabling only the if-part of the expression in the
> source snippet above.

Good idea.

> dd: 1206199580 bytes/sec
> bonnie++: 550-700 MB/sec, no hang
>
> During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
> and dirtybufthresh are at their defaults:
> vfs.dirtybufthresh: 46925
> vfs.hidirtybuffers: 52139
> vfs.lodirtybuffers: 26069
> vfs.numdirtybuffers: 15
>
> So it looks like you were spot-on by suspecting cluster_write().

Not for the reasons that I suspected it!

> Further tests confirmed that "-o noclusterw" is sufficient to prevent
> the hangs and provide good performance. "-o noclusterr" on its own
> makes no difference; the system will hang.

I thought -noclusterw would be worse until now...

> I have also tested with BKVASIZE set to 65536. As you explained, this
> reduced the number of buffers:
> vfs.dirtybufthresh: 11753
> vfs.hidirtybuffers: 13059
> vfs.lodirtybuffers: 6529
>
> dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
> with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
> numdirtybuffers reaches max:
>
> vfs.numdirtybuffers: 13059
>
> Even though numdirtybuffers reaches hidirtybuffers, the system does
> not hang, but performance is not very good.

... I hoped BKVASIZE would be the problem, but it is at most a secondary
problem.

> Here is the relevant part from bonnie++'s source (in C++):
> ...
> So you are correct: bonnie++ re-reads the file that was created
> previously in the "Writing intelligently..." phase in blocks, modifies
> one byte in the block, and writes the block back.
>
> Something in this specific workload is triggering the huge buildup of
> numdirtybuffers when write-clustering is enabled.

I can explain this.  Readers and writers share the offset for the
sequential heuristic (it is per-open-file), so cluster_write() cannot
tell that bonnie's writes are sequential.  The sequential_heuristic()
function sees the offset moving back and forth, which normally means
random access, so it tells cluster_write() that the access is random,
allthiugh all bonnie seeks back after each read() so that all the
writes() are sequential.  cluster_write() rarely if ever pushes out
random writes, so the write pressure builds up to a saturation value
almost instantly.  Then all subsequent writes work even more poorly
due to the pressure.  Truly random writes would give similar buildup
of pressure.

> I am now looking at vfs_cluster.c to see whether I can find which part
> is responsible for letting numdirtybuffers raise without bounds and
> why only *re* writing a file causes problems, not the initial
> writing. Any suggestions on where to start looking are very welcome.

It is very complicated, but it was easy to find its comments saying that
it tries not to force out the writes for non-seqential accesses.  I
am currently trying the following workarounds:

% diff -u2 vfs_cluster.c~ vfs_cluster.c
% --- vfs_cluster.c~	Tue Apr  6 20:11:55 2004
% +++ vfs_cluster.c	Tue Jun  4 04:53:52 2013
% @@ -593,4 +714,6 @@
%  }
% 
% +int	buf_dirty_count_large(void);
% +
%  /*
%   * Do clustered write for FFS.

This is like buf_dirty_count_severe() except it has a lower threshold.
It doesn't really help.

% @@ -630,4 +753,44 @@
%  		vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0;
% 
% +	/*
% +	 * XXX the 13 year old changes in the algorithm to use seqcount
% +	 * are too agressive.  Avoid them by faking sequential access if
% +	 * we are under write pressure.  This causes us to write out
% +	 * clusters in more cases (in particular, when clusters are
% +	 * complete).

This doesn't really help.  It's too late to start faking after pressure
has built up.

% +	 *
% +	 * XXX we should probably use a more conservative condition for
% +	 * write pressure than is currently available.  Try not to leave
% +	 * anything to bufdaemon, since it can too easily fail to clear
% +	 * write pressure because the processes creating the write
% +	 * pressure have the vnode locked.  E.g., use lodirtybuffers
% +	 * instead of hidirtybuffers in buf_dirty_count_severe(), and
% +	 * also check dirtybufthesh.

I use lodirtybuffers as a default.  This is still too high, and almost
any value reduces performance.

% +	 *
% +	 * XXX seqcount itself is broken.  It gets cleared by reads of
% +	 * the same open file, even if the application seeks back after
% +	 * reading so that writes are perfectly sequential.  This case
% +	 * is probably unimportant, but bonnie does it.

As describe differently above.

% +	 *
% +	 * XXX I think that if bonnie didn't seek back then the write
% +	 * pressure would be the same (provided the bonnie write size is
% +	 * smaller than the fs block size).  Then the writes wouldn't be
% +	 * sequential, so a fixed seqcount wouldn't help force them out.
% +	 * But they need to be forced out under write pressure, and I
% +	 * don't see anything to do it.  Check this.
% +	 *
% +	 * XXX the writing system shouldn't collapse when we get this
% +	 * slightly wrong.  Why doesn't bdwrite() force things when
% +	 * we don't?  It has some code to do so.  It is still better
% +	 * for us to force out full clusters than to leave it to
% +	 * the writing system to possibly force out sub-buffers.
% +	 *
% +	 * XXX try killing async under the same conditions.

This is too late to reduce pressure from async too.

% +	 */
% +	if (vm_page_count_severe() || buf_dirty_count_large()) {
% +		seqcount = 2;
% +		async = 0;
% +	}
% +
%  	if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 ||
%  	    (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) {
% @@ -719,4 +882,5 @@
%  		} else {	/* Wait for rest of cluster */
%  			vp->v_cstart = lbn;
% +			/* XXX problem if we're async and orig maxclen = 0. */
%  			bdwrite(bp);
%  		}

This case for async breaks the bugfeature of not honoring it.  The
bugfeature often prevents buildup of exessive pressure, but some cases
arrive here and then it doesn't.  This is the only old change I have in
cluster_write().

% @@ -726,8 +890,13 @@
%  		 * are operating sequentially, otherwise let the buf or
%  		 * update daemon handle it.
% +		 *
% +		 * Algorithm changeback: ignore seqcount here, at least for
% +		 * now, to work around readers breaking it for writers.  It
% +		 * is too late to start ignoring after write pressure builds
% +		 * up, since not writing out here is the main source of the
% +		 * buildup.
%  		 */
%  		bdwrite(bp);
% -		if (seqcount > 1)
% -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
% +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
%  		vp->v_clen = 0;
%  		vp->v_cstart = lbn + 1;

This works fairly well.  Changing the algorithm back in all cases reduces
performance, but here we have usually finished with the buffer.  Even if
seqcount is fixed so that it isn't clobbered by reads, it might be right
to ignoe it here, so as to reduce write pressure, until write pressure is
handled better.

I used the following program to generate i/o like bonnie's:

%%%
#include <err.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define	IOSIZE	8192

static char buf[IOSIZE] __aligned(4096);

int
main(int argc, char **argv)
{
 	ssize_t n;
 	int ifd, ofd;

 	if (argc != 2 && argc != 3)
 		errx(1, "usage: ./prog file1 [file2]");
 	ifd = open(argv[1], O_RDWR);
 	if (ifd < 0)
 		err(1, "open");
 	if (argc == 3)
 		ofd = open(argv[2], O_RDWR);
 	else
 		ofd = ifd;
 	if (ifd < 0)
 		err(1, "open");
 	for (;;) {
 		n = read(ifd, buf, IOSIZE);
 		if (n == 0) {
 			printf("finished: ioffset %#jx; ooffset = %#jx\n",
 			    (intmax_t)lseek(ifd, 0, SEEK_CUR),
 			    (intmax_t)lseek(ofd, 0, SEEK_CUR));
 			exit(0);
 		}
 		if (n < 0)
 			err(1, "read error");
 		if (n != IOSIZE)
 			errx(1, "short read %d", (int)n);
 		if (ofd == ifd)
 			lseek(ofd, -IOSIZE, SEEK_CUR);
 		n = write(ofd, buf, IOSIZE);
 		if (n < 0)
 			err(1, "write error");
 		if (n != IOSIZE)
 			errx(1, "short write %d", (int)n);
 	}
}
%%%

Usage: create a test file zz of a suitable size (the program normally
doesn't extend it); then ./prog zz does read/write like bonnie;
./prog zz zz does the same with independent file positions.

Bruce