kern/178997: Heavy disk I/O may hang system

Mon Jun 17 00:20:05 UTC 2013

The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>
To: Bruce Evans <brde at optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1 at unix-admin.de>,
	freebsd-gnats-submit at FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 17 Jun 2013 02:17:16 +0200

 On Mon, Jun 10, 2013 at 11:00:29AM +1000, Bruce Evans wrote:
 > On Mon, 10 Jun 2013, Klaus Weber wrote:
 > >On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
 > >>On Fri, 31 May 2013, Klaus Weber wrote:
 >
 > This thread is getting very long, and I will only summarize a couple
 > of things that I found last week here.  Maybe more later.
 >
 > o Everything seems to be working as well as intended (not very well)
 >   except in bufdaemon and friends.  Perhaps it is already fixed there.
 >   I forgot to check which version of FreeBSD you are using.  You may
 >   be missing some important fixes.  There were some by kib@ a few
 >   months ago, and some by jeff@ after this thread started. 

 I have now tested with both an up-to-date 9-STABLE as well as a
 10-CURRENT kernel from ftp.freebsd.org (8.6.2016):

 FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #27 r251798M: Sun Jun
 16 16:19:18 CEST 2013 root at filepile:/usr/obj/usr/src/sys/FILEPILE
 amd64

 FreeBSD filepile 10.0-CURRENT FreeBSD 10.0-CURRENT #0: Sat Jun 8
 22:10:23 UTC 2013 root at snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC
 amd64

 The bad news is that I can still reproduce the hangs reliably, with
 the same methods as before. I do have the feeling that the machine
 survives the re-write load a bit longer, but its just a gut feeling -
 I have no means to quantify this.

 The "dirtybuf" sysctl values have changed their default values; I
 believe you were involved in the discussion about how they are
 computed that lead to changing them. They are now:
 vfs.lodirtybuffers: 13251
 vfs.hidirtybuffers: 26502
 vfs.dirtybufthresh: 23851

 The 10-CURRENT kernel, shows a message on the console when
 the hang occurs (but see below for a twist):

 g_vfs_done():da0p1[WRITE(offset=<x>, length=<y>)]error = 11

 The message is printed to the console about once a second when the
 hang occurs, until I hard-reset the server. <x> keeps changing,
 <y> is mostly 65536, sometimes 131072 (like, one screenful of 64k's
 then one 128k line, etc.)

 Error=11 seems to be EDEADLK, so it looks like a -CURRENT kernel does
 detect a deadlock sitation whereas the -STABLE kernel doesn't. (I have
 also re-checked the -CURRENT from 5.5.2013 which I briefly tested in
 my initial report also shows these messages. My apologies for not
 noticing this earlier).

 And now there is an additional point where things get really weird:
 The "g_vfs_done()..." messages _only_ appear when the system hangs as
 a result of bonnie++'s rewriting load. If I repeat the same test with
 the test program you provided earlier, the system will still hang, but
 without the console output.

 (I have repeated the tests 10 times to make sure I wasn't just
 imagining this; 5 times with bonnie++ (message appears every time), 
 5 times with your test program (message did not appear a single
 time)).

 I'm not really sure what difference between these two programs causes
 this. Bonnie++ writes the file immediately before starting to rewrite
 it, while your program works on pre-fabricated files - maybe this it.

 Anyway: Is there a way to find out which resource(s) are involved in
 the EDEADLK-situation? Please keep in mind that I cannot enter the
 debugger via the console, and panicing the machine leaves me with a
 dead keyboard, which seems to make any useful debugging pretty hard.

 (If you believe that a working debugger is required to make progress
 on this, let me know and I will try to get a serial console working
 somehow.)

 >>[observations regarding seek behavior] 

 I did not get to do much other testing this weekend (hanging the
 system is pretty time-consuming, as after the hard-reset I have to
 wait for the root partition to re-mirror). 

 I have thought about how I could observe the seek behavior of my disks
 when testing, but I have come to the conclusion that I cannot do this
 really. My RAID controller presents the array as a single disk to the
 OS, and its on-board RAM allows it to cache writes and perform
 read-aheads without the kernel knowing, so neither reading from nor
 writing to two blocks far apart will neccessarily result in a physical
 head move.

 I think I'll concentrate next on finding out where/why the hangs are
 occuring. Getting rid of the hangs and hard resets will hopefully
 speed up testing cycles.

 Klaus