kern/178997: Heavy disk I/O may hang system

Sun May 26 20:00:00 UTC 2013

>Number:         178997
>Category:       kern
>Synopsis:       Heavy disk I/O may hang system
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun May 26 20:00:00 UTC 2013
>Closed-Date:
>Last-Modified:
>Originator:     Klaus Weber
>Release:        Tested with 9/STABLE, verified in 10-CURRENT
>Organization:
-
>Environment:
FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013     root at filepile:/usr/obj/usr/src/sys/FILEPILE  amd64
(Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well)
>Description:
During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase. 

Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers.

Most testing was done on 9/Stable and an Areca ARC-1882 controller (arcmsr), but I have test-booted a 10.0-current kernel from early May ("kernel-10.0-CURRENT-5.5.2013"), and verified that the problem still exists there. I have also tested with with an on-board SATA port ("Intel Cougar Point AHCI SATA controller"); with this setup, I can only observe extreme throughput degradation, but no hangs (see below for details).

Test-setup:
da0 is a 600 GB volume on the Areca controller:
da0 at arcmsr0 bus 0 scbus0 target 0 lun 0
da0: <Areca ARC-1882-VOL#000 R001> Fixed Direct Access SCSI-5 device 
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 572202MB (1171870720 512 byte sectors: 255H 63S/T 72945C)

gpart create -s gpt da0
gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
newfs -b 64k -f 8k /dev/da0p1
mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

To observe the problem, I start two bonnie++ instances (installed from ports) like this:
1) bonnie++ -s 64g -n 0 -d /mnt/t1
2) bonnie++ -s 64g -n 0 -d /mnt/t2

(in order to better see the various stages, I start the second bonnie++ about a minute after the first. There is nothing magic about the 64gb, the size should be large enough to keep the disk busy for a few minutes.)

I then observe disk throughput for da0 with 

iostat -dC -w 1 da0

             da0             cpu
  KB/t tps  MB/s  us ni sy in id
[... first bonnie++ starts, "writing 1 byte at a time" phase ...]
 64.00   6  0.37   0  0 12  0 88
 64.00   4  0.25   0  0 13  0 87
 64.00   6  0.37   0  0 12  0 88
[first bonnie++ enters "writing intelligently" phase]
 63.97 3847 240.33   0  0 12  0 87
 64.00 20824 1301.51   0  0 14  0 85
 64.00 20449 1278.04   0  0 14  1 85
 64.00 17840 1115.01   0  0 14  1 85
 64.00 16595 1037.21   0  0 13  1 85
[this is my baseline for upper limit on throughput: ~1 GB/sec]
[...second bonnie++ starts, no big changes]
 64.00 12869 804.30   0  0 15  1 84
[second bonnie++ gets to "Writing intelligently" phase]
 64.00 7773 485.78   0  0  9  0 90
 64.00 7310 456.86   0  0  7  1 92
 64.00 7430 464.40   0  0 10  0 89
 64.00 7151 446.93   0  0  6  1 93
 64.00 7410 463.10   0  0  8  0 91
[...]
 64.00 7349 459.29   0  0 10  0 90
 64.00 7478 467.35   0  0  7  0 93
 64.00 7153 447.06   0  0  9  0 90
 64.00 7335 458.41   0  0  7  0 93
 64.00 7918 494.88   0  0 10  0 90
[first bonnie++ gets to "Rewriting" phase]
 64.00 7426 464.10   0  0 12  1 87
 64.00 7251 453.17   0  0  9  0 90
 64.00 7036 439.75   0  0  6  1 93
 64.00 7296 455.98   0  0  8  1 90
 64.00 6808 425.51   0  0  6  1 93
 64.00 7531 470.66   0  0  8  1 91
 64.00 7323 457.67   0  0  8  0 91
 64.00 7666 479.14   0  0  6  0 94
 64.00 11921 745.07   0  0 12  1 87
 64.00 12043 752.69   0  0 12  1 87
 64.00 10814 675.89   0  0 11  1 89
 64.00 10925 682.82   0  0 12  0 88
[...]
[second bonnie++ starts "Rewriting" phase]
 64.00 10873 679.56   0  0 10  1 89
 64.00 12242 765.12   0  0 13  1 86
 64.00 10739 671.19   0  0 11  0 88
 64.00 3486 217.91   0  0  8  0 91
 64.00 6327 395.42   0  0 14  0 86
 64.00 6637 414.84   0  0  9  0 90
 64.00 4931 308.19   0  0 10  0 89
 64.00 4874 304.63   0  0 13  0 87
 64.00 4322 270.11   0  0 18  0 82
 64.00 4616 288.52   0  0 19  0 80
 64.00 3567 222.96   0  0 20  0 80
 64.00 3395 212.16   0  0 30  0 70
 64.00 1694 105.89   0  0 33  0 67
 64.00 1898 118.63   0  0 36  0 64
  0.00   0  0.00   0  0 12  0 88
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  4  0 96
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
[etc. The system does not recover from this state. All attempts to do disk I/O hang.]

When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active.

I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk:

[same setup as above, but on single disk connected to onboard SATA port.]
Baseline: 150 MB/sec (first bonnie++ in "Writing intelligently" phase):
            ada2             cpu
  KB/t tps  MB/s  us ni sy in id
 128.00 1221 152.60   0  0  2  0 97
 128.00 1173 146.60   0  0  2  0 98
 128.00 1171 146.35   0  0  2  0 98
[... first bonnie++ in "Rewriting", second in "Writing intelligently" phase:]
 127.83 1140 142.30   0  0  2  0 97
 128.00 1121 140.11   0  0  2  0 97
 128.00 1114 139.24   0  0  1  0 98
 127.29 904 112.39   0  0  5  0 95
[second bonnie++ starts rewriting]
 120.24  66  7.74   0  0  0  0 100
 128.00  59  7.37   0  0  0  0 100
 122.43  69  8.24   0  0  1  0 99
 128.00 222 27.72   0  0  0  0 100
 128.00 1122 140.23   0  0  3  0 97
 128.00 1080 134.99   0  0  3  0 97
 128.00 1157 144.61   0  0  3  0 97
 127.75 504 62.94   0  0  2  0 98
 128.00  59  7.37   0  0  1  0 99
 128.00 708 88.54   0  0  2  0 98
 128.00 969 121.13   0  0 11  0 89
 127.89 1161 144.99   0  0  3  0 97
 127.60 479 59.63   0  0  1  0 99
 128.00 596 74.55   0  0  2  0 98
 128.00 1169 146.10   0  0  3  0 97
 127.88 1076 134.37   0  0  2  0 97
 127.89 1202 150.10   0  0  3  0 97
 128.00 1074 134.24   0  0  4  0 96
 128.00 1039 129.87   0  0  8  0 92
 128.00 913 114.14   0  0 10  0 90
 127.73 896 111.78   0  0 11  0 89
 128.00 930 116.26   0  0 10  0 90
 127.93 909 113.57   0  0 11  0 89
 128.00 887 110.89   0  0 13  0 87
 128.00 743 92.91   0  0 18  0 81
 128.00 548 68.56   0  0 23  0 76
 128.00 523 65.43   0  0 24  0 76
 127.89 587 73.36   0  0 21  0 79
 127.89 586 73.24   0  0 22  0 78
 128.00 465 58.07   0  0 28  0 72
 127.89 558 69.74   0  0 25  0 75
 127.79 595 74.30   0  0 21  0 79
 128.00 593 74.18   0  0 23  0 77
 128.00 546 68.31   0  0 22  0 78
 127.70 428 53.32   0  0 32  0 68
 128.00 561 70.18   0  0 25  0 75
 128.00 645 80.67   0  0 22  0 78
 127.78 582 72.68   0  0 23  0 77
 114.68 518 58.07   0  0 27  0 73
 103.33 497 50.20   0  0 29  0 71
 110.00 128 13.74   0  0 35  0 65
 65.94  33  2.12   0  0 37  0 63
 74.97 286 20.92   0  0 36  0 64
 105.83 306 31.59   0  0 34  0 66
 95.12 399 37.03   0  0 30  0 70
 108.57 336 35.59   0  0 29  0 71
 127.22 246 30.53   0  0 41  0 59
 104.16 255 25.91   0  0 32  0 68
 103.68 528 53.51   0  0 23  0 77
 127.14 373 46.27   0  0 29  0 71
 90.65 341 30.16   0  0 35  0 65
 127.30 368 45.70   0  0 30  0 70
 66.56  25  1.62   0  0 37  0 63
 66.00  32  2.06   0  0 38  0 62
 94.99  95  8.80   0  0 38  0 62
 94.00 433 39.71   0  0 35  0 65
 107.66 666 70.06   0  0 26  0 74
 119.43 478 55.69   0  0 27  0 73
 95.64 441 41.15   0  0 28  0 72
 116.45 449 51.01   0  0 29  0 71
 102.22 503 50.26   0  0 28  0 72
 127.62 509 63.50   0  0 29  0 71
 111.84 388 42.33   0  0 29  0 71

(i.e. disk throughput jumps wildly between full speed and ~1% of maximum, despite continuous I/O pressure)

Unfortunately, I cannot use kdb on this machine - "ALT-CTRL-ESC" does nothing and "sysctl debug.kdb.enter=1" just says "debug.kdb.enter: 0 -> 0". "sysctl debug.kdb.panic=1" does panic the machine, but the keyboard is not working.

I have, however, logged the sysctl vas subtree every second while the problem occurs: 
As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted:

00-03-28.log:vfs.numdirtybuffers: 107
00-03-29.log:vfs.numdirtybuffers: 106
00-03-30.log:vfs.numdirtybuffers: 105
00-03-31.log:vfs.numdirtybuffers: 105
00-03-32.log:vfs.numdirtybuffers: 110
00-03-33.log:vfs.numdirtybuffers: 112
00-03-34.log:vfs.numdirtybuffers: 110
00-03-35.log:vfs.numdirtybuffers: 115
00-03-36.log:vfs.numdirtybuffers: 117
00-03-37.log:vfs.numdirtybuffers: 118
00-03-38.log:vfs.numdirtybuffers: 118
00-03-39.log:vfs.numdirtybuffers: 118
00-03-40.log:vfs.numdirtybuffers: 118
[first bonnie++ starts Rewriting]
00-03-41.log:vfs.numdirtybuffers: 294
00-03-42.log:vfs.numdirtybuffers: 404
00-03-43.log:vfs.numdirtybuffers: 959
00-03-44.log:vfs.numdirtybuffers: 1071
00-03-45.log:vfs.numdirtybuffers: 1155
00-03-46.log:vfs.numdirtybuffers: 1253
00-03-47.log:vfs.numdirtybuffers: 1366
00-03-48.log:vfs.numdirtybuffers: 1457
00-03-49.log:vfs.numdirtybuffers: 1549
00-03-50.log:vfs.numdirtybuffers: 1680
00-03-51.log:vfs.numdirtybuffers: 1904
00-03-52.log:vfs.numdirtybuffers: 2136
00-03-53.log:vfs.numdirtybuffers: 2363
00-03-54.log:vfs.numdirtybuffers: 2503
00-03-55.log:vfs.numdirtybuffers: 3023
00-03-56.log:vfs.numdirtybuffers: 3268
00-03-57.log:vfs.numdirtybuffers: 3601
00-03-58.log:vfs.numdirtybuffers: 3743
00-03-59.log:vfs.numdirtybuffers: 3904
00-04-00.log:vfs.numdirtybuffers: 4106
00-04-01.log:vfs.numdirtybuffers: 4325
00-04-02.log:vfs.numdirtybuffers: 4461
00-04-03.log:vfs.numdirtybuffers: 4643
00-04-04.log:vfs.numdirtybuffers: 5063
00-04-05.log:vfs.numdirtybuffers: 5202
00-04-06.log:vfs.numdirtybuffers: 5456
00-04-07.log:vfs.numdirtybuffers: 5593
00-04-08.log:vfs.numdirtybuffers: 5819
00-04-09.log:vfs.numdirtybuffers: 6092
00-04-10.log:vfs.numdirtybuffers: 6249
00-04-11.log:vfs.numdirtybuffers: 6412
00-04-12.log:vfs.numdirtybuffers: 6591
00-04-13.log:vfs.numdirtybuffers: 6698
00-04-14.log:vfs.numdirtybuffers: 6853
00-04-15.log:vfs.numdirtybuffers: 7031
00-04-16.log:vfs.numdirtybuffers: 7192
00-04-17.log:vfs.numdirtybuffers: 7387
00-04-18.log:vfs.numdirtybuffers: 7609
00-04-19.log:vfs.numdirtybuffers: 7757
00-04-20.log:vfs.numdirtybuffers: 7888
00-04-21.log:vfs.numdirtybuffers: 8156
00-04-22.log:vfs.numdirtybuffers: 8362
00-04-23.log:vfs.numdirtybuffers: 8558
[second bonnie goes Rewriting as well]
00-04-24.log:vfs.numdirtybuffers: 11586
00-04-25.log:vfs.numdirtybuffers: 16325
00-04-26.log:vfs.numdirtybuffers: 24333
00-04-27.log:vfs.numdirtybuffers: 29058
00-04-28.log:vfs.numdirtybuffers: 33752
00-04-29.log:vfs.numdirtybuffers: 38792
00-04-30.log:vfs.numdirtybuffers: 42663
00-04-31.log:vfs.numdirtybuffers: 47305
00-04-33.log:vfs.numdirtybuffers: 52070
00-04-34.log:vfs.numdirtybuffers: 52066
00-04-35.log:vfs.numdirtybuffers: 52063
00-04-37.log:vfs.numdirtybuffers: 52098
00-04-39.log:vfs.numdirtybuffers: 52098
00-04-41.log:vfs.numdirtybuffers: 52097
00-04-44.log:vfs.numdirtybuffers: 52097
00-04-46.log:vfs.numdirtybuffers: 52098
00-04-49.log:vfs.numdirtybuffers: 52096
00-04-52.log:vfs.numdirtybuffers: 52098
00-04-54.log:vfs.numdirtybuffers: 52096
00-04-57.log:vfs.numdirtybuffers: 52098
00-05-00.log:vfs.numdirtybuffers: 52096
[ etc. ]

vfs.hidirtybuffers: 52139
vfs.lodirtybuffers: 26069
(the machine has 32 GB RAM)

(I could upload the archive of the vfs logs somewhere, so you can check other values over time, if that would be helpful)

I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one).

The symptoms described in r189878 (issues in bufdaemon, causing processes to hang in nbufkv) sound simliar. 

Please let me know if there is any other info that I can provide.

>How-To-Repeat:
gpart create -s gpt da0
gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
newfs -b 64k -f 8k /dev/da0p1
mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

Start two bonnie++ instances (installed from ports) like this:
1) bonnie++ -s 64g -n 0 -d /mnt/t1
2) bonnie++ -s 64g -n 0 -d /mnt/t2

Observe disk throughput for da0 with 

iostat -dC -w 1 da0

in particular when both bonnie++ processes are in their "Rewriting..." phase.
>Fix:
No fix known.

>Release-Note:
>Audit-Trail:
>Unformatted: