kern/178997: Heavy disk I/O may hang system

Fri May 31 16:40:04 UTC 2013

The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>
To: Bruce Evans <brde at optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>,
	freebsd-gnats-submit at FreeBSD.org, freebsd-bugs at FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Fri, 31 May 2013 18:31:50 +0200

 Sorry for the late reply, testing took longer than expected.

 (I have combined your replies from separate mails into one, and
 reordered some of the text.)

 On Tue, May 28, 2013 at 10:03:10PM +1000, Bruce Evans wrote:
 > On Mon, 27 May 2013, Klaus Weber wrote:
 > >On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
 > >>On Sun, 26 May 2013, Klaus Weber wrote:

 > However, I have never been able to reproduce serious fragmentation problems
 > from using too-large-block sizes, or demonstrate significant improvements
 > from avoiding the known fragmentation problem by increasing BKVASIZE.
 > Perhaps my systems are too small, or have tuning or local changes that
 > accidentally avoid the problem.
 > 
 > Apparently you found a way to reproduce the serious fragmentaion
 > problems.  Try using a block size that doesn't ask for the problem.
 > (...)
 > The reduced fsck time and perhaps the reduced number of cylinder groups
 > are the main advantages of large clusters.  vfs-level clustering turns
 > most physical i/o's into 128K-blocks (especially for large files) so
 > there is little difference between the i/o speed for all fs block sizes
 > unless the fs block size is very small.

 I have now repeated the tests with several variations of block- and
 fragment sizes. In all cases, I did two tests:

 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
     bonnie++ -s 64g -n 0 -f -D -d /mnt/t2

 The dd is simply to give a rough idea of the performance impact of the
 fs parameters, with the two bonnie++ processes, I was mainly
 interested in performance and hangs when both bonnie++ processes are
 in their "Rewriting" phase. I have also tested variations where the
 block:fragment ratio does not follow the 8:1 recommendation.

 64/8k, kernel unpatched:
 dd: 1218155712 bytes/sec
 bonnie++: around 300 MB/sec, then drops to 0 and system hangs

 32/4k, kernel unpatched:
 dd: 844188881 bytes/sec
 bonnie++: jumps between 25 and 900 MB/sec, no hang

 16/2k, kernel unpatched:
 dd: 517996142 bytes/sec
 bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of 
                   400-650 MB/sec, no hang

 64/4k, kernel unpatched:
 dd: 1156041159 bytes/sec
 bonnie++: hangs system quickly once both processes are in rewriting

 32/8k, kernel unpatched:
 dd: 938072430 bytes/sec
 bonnie++: 29-50 MB/sec, with 3-10 second "bursts" of 
                   up to 650 MB/sec, no hang (but I canceled the test
 		  after an hour or so).

 So while a file system created with the current (32/4k) or old (16/2k)
 defaults  does prevent the hangs, it also reduces the sequential write
 performance to 70% and 43% of an 64/8k fs.

 The problem seems to be 64k block size, not the 8k fragment size.

 In all cases, the vfs.numdirtybuffers count remained fairly small as
 long as the bonnie++ processes were writing the testfiles. It rose to
 vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
 faster when both processes are rewriting).

 > >00-04-57.log:vfs.numdirtybuffers: 52098
 > >00-05-00.log:vfs.numdirtybuffers: 52096
 > >[ etc. ]
 > 
 > This is a rather large buildup and may indicate a problem.  Try reducing
 > the dirty buffer watermarks. 

 I have tried to tune system parameters as per your advice, in an attempt
 to get a 64/8k fs running stable and with reasonable write performance.
 (dd: results omitted for brevity, all normal for 64/8k fs)

 all with 64/8k, kernel unpatched:
 vfs.lodirtybuffers=250
 vfs.hidirtybuffers=1000
 vfs.dirtybufthresh=800

 bonnie++: 40-150 MB/sec, no hang

 vfs.numdirtybuffers raises to 1000 when both processes are rewriting.

 vfs.lodirtybuffers=1000
 vfs.hidirtybuffers=4000
 vfs.dirtybufthresh=3000

 bonnie++: 380-50 MB/sec, no hang

 For the next tests, I kept lo/hidirtybuffers at 1000/4000, and only
 varied dirtybufthresh:
 1200: bonnie++: 80-370 MB/sec
 1750: bonnie++: around 600 MB/sec
 1900: bonnie++: around 580 MB/sec. vfs.numdirtybuffers=3800 (i.e. does
           reach vfs.hidirtybuffers anymore!)
 (no hangs in any of the tests).

 I then re-tested with lo/hidirtybuffers at their defaults, and only
 dirtybufthresh set to slightly less half of hidirtybuffers:

 vfs.lodirtybuffers=26069
 vfs.hidirtybuffers=52139
 vfs.dirtybufthresh=26000

 dd: 1199121549 bytes/sec
 bonnie++: 180-650 MB/sec, mostly around 500, no hang

 By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
 I found that
 (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
 less than vfs.hidirtybuffers for reasonable performance without hangs.

 vfs.numdirtybuffers rises to
 (vfs.dirtybufthresh) * (number of bonnie++ process) 
 and as long as this is below vfs.hidirtybuffers, the system will not
 hang.

 > >>I found that
 > >>the problem could be fixed by killing cluster_write() by turning it into
 > >>bdwrite() (by editing the running kernel using ddb, since this is easier
 > >>than rebuilding the kernel).  I was trying many similar things since I
 > >>had a theory that cluster_write() is useless.  [...]
 > >
 > >If that would provide a useful datapoint, I could try if that make a
 > >difference on my system. What changes would be required to test this?
 > >
 > >Surely its not as easy as replacing the function body of
 > >cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?
 > 
 > That should work for testing, but it is safer to edit ffs_write()
 > and remove the block where it calls cluster_write() (or bawrite()),
 > so that it falls through to call bdwrite() in most cases.

 I was not sure whether to disable the "bawrite(bp);" in the else part
 as well. Here is what I used for the next test (in ffs_write):

   } else if (vm_page_count_severe() ||
               buf_dirty_count_severe() ||
               (ioflag & IO_ASYNC)) {
           bp->b_flags |= B_CLUSTEROK;
           bawrite(bp);
           /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
           if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
                   bp->b_flags |= B_CLUSTEROK;
                   cluster_write(vp, bp, ip->i_size, seqcount);
           } else {
                   bawrite(bp);
                   } KWKWKW */
   } else if (ioflag & IO_DIRECT) {
           bp->b_flags |= B_CLUSTEROK;
           bawrite(bp);
   } else {
           bp->b_flags |= B_CLUSTEROK;
           bdwrite(bp);
   }

 dd: 746804775 bytes/sec

 During the dd tests, iostat shows a weird, sawtooth-like behavior:
  64.00 26730 1670.61   0  0 12  3 85
  64.00 13308 831.73   0  0  4  1 95
  64.00 5534 345.85   0  0 10  1 89
  64.00  12  0.75   0  0 16  0 84
  64.00 26544 1658.99   0  0 10  2 87
  64.00 12172 760.74   0  0  3  1 95
  64.00 8190 511.87   0  0  8  1 91
  64.00  10  0.62   0  0 14  0 86
  64.00 22578 1411.11   0  0 14  3 83
  64.00 12634 789.63   0  0  3  1 95
  64.00 11695 730.96   0  0  6  2 92
  48.00   7  0.33   0  0 13  0 87
  64.00 11801 737.58   0  0 17  1 82
  64.00 19113 1194.59   0  0  6  2 92
  64.00 15996 999.77   0  0  4  2 94
  64.00   3  0.19   0  0 13  0 87
  64.00 10202 637.63   0  0 16  1 83
  64.00 20443 1277.71   0  0  8  2 90
  64.00 15586 974.10   0  0  4  1 95
  64.00 682 42.64   0  0 13  0 87

 With two bonnie++ processes in "Writing intelligently" phase, iostat
 jumped between 9 and 350 MB/sec. I had cancelled the test before the
 first bonnie++ process reached "Rewriting..." phase, due to the dismal
 performance.

 Already during the "Writing intelligently" phase,  vfs.numdirtybuffers
 reaches vfs.hidirtybuffers (during previous tests, vfs.numdirtybuffers
 only raises to high numbers in the "Rewriting..." phase).

 After reverting the source change, I have decided to try mounting the
 file system with "-o noclusterr,noclusterw", and re-test. This is
 equivalent to disabling only the if-part of the expression in the
 source snippet above.

 dd: 1206199580 bytes/sec
 bonnie++: 550-700 MB/sec, no hang

 During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
 and dirtybufthresh are at their defaults:
 vfs.dirtybufthresh: 46925
 vfs.hidirtybuffers: 52139
 vfs.lodirtybuffers: 26069
 vfs.numdirtybuffers: 15

 So it looks like you were spot-on by suspecting cluster_write(). 
 Further tests confirmed that "-o noclusterw" is sufficient to prevent
 the hangs and provide good performance. "-o noclusterr" on its own
 makes no difference; the system will hang.

 I have also have tested with write-clustering enabled, but with
 vfs.write_behind=0 and vfs.write_behind=2, respectively. In both
 cases, the system hangs with two bonnie++ in "Rewriting...".

 I have also tested with BKVASIZE set to 65536. As you explained, this
 reduced the number of buffers:
 vfs.dirtybufthresh: 11753
 vfs.hidirtybuffers: 13059
 vfs.lodirtybuffers: 6529

 dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
 with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
 numdirtybuffers reaches max:

 vfs.numdirtybuffers: 13059

 Even though numdirtybuffers reaches hidirtybuffers, the system does
 not hang, but performance is not very good.

 With BKVASIZE set to 65536 _and_ the fs mounted "-o noclusterw",
 the performance is same as with BKVASIZE of 16k, and the system does
 not hang.

 I have now reverted BKVASIZE to its default, as the main factor for a
 stable and fast system seems to be the noclusterw mount option.

 > >>Apparently you found a way to reproduce the serious fragmentaion
 > >>problems.
 > >
 > >A key factor seems to be the "Rewriting" operation. I see no problem
 > >during the "normal" writing, nor could I reproduce it with concurrent
 > >dd runs.
 > 
 > I don't know exactly what bonnie rewrite bmode does.  Is it just read/
 > [modify]/write of sequential blocks with a fairly small block size?
 > Old bonnie docs say that the block size is always 8K.  One reason I
 > don't like bonnie.  Clustering should work fairly normally with that.
 > Anything with random seeks would break clustering.

 Here is the relevant part from bonnie++'s source (in C++):
 ---------------
 bufindex = 0;
 for(words = 0; words < num_chunks; words++)
 { // for each chunk in the file
   dur.start();
   if (file.read_block(PVOID(buf)) == -1)
     return 1;
   bufindex = bufindex % globals.io_chunk_size();
   buf[bufindex]++;
   bufindex++;
   if (file.seek(-1, SEEK_CUR) == -1)
     return 1;
   if (file.write_block(PVOID(buf)) == -1)
     return io_error("re write(2)");
 -----------
 globals.io_chunk_size() is 8k (by default and in all of my tests), and
 bonnie++ makes sure that buf is page-aligned. 

 So you are correct: bonnie++ re-reads the file that was created
 previously in the "Writing intelligently..." phase in blocks, modifies
 one byte in the block, and writes the block back.

 Something in this specific workload is triggering the huge buildup of
 numdirtybuffers when write-clustering is enabled.

 I am now looking at vfs_cluster.c to see whether I can find which part
 is responsible for letting numdirtybuffers raise without bounds and
 why only *re* writing a file causes problems, not the initial
 writing. Any suggestions on where to start looking are very welcome.

 Klaus