deadlock or bad disk ? RELENG_8
Jeremy Chadwick
freebsd at jdc.parodius.com
Mon Jul 19 03:58:47 UTC 2010
On Sun, Jul 18, 2010 at 08:34:24PM -0700, Jeremy Chadwick wrote:
> On Sun, Jul 18, 2010 at 11:01:03PM -0400, Mike Tancsa wrote:
> > I do track some basic mem stats via rrd. Looking at the graphs upto
> > that period, nothing unusual was happening
>
> sysctl vm.stats.vm | grep swap
>
> Here's another post basically reiterating the same thing: that the
> controller the swap slice is on (in your case a 4-disk RAID array) is
> basically taking too long to respond.
>
> http://groups.google.com/group/mailing.freebsd.stable/browse_thread/thread/2e7faeeaca719c52/cdcd4601ce1b90c5
>
> I have no idea where the timeout values are in the kernel. I do see
> these two entries in sysctl that look to be of interest though. You
> might try adjusting these (not sure if they're sysctls or loader.conf
> tunables only):
>
> vm.swap_idle_threshold2: 10
> vm.swap_idle_threshold1: 2
>
> Descriptions:
>
> vm.swap_idle_threshold2: Time before a process will be swapped out
> vm.swap_idle_threshold1: Guaranteed swapped in time for a process
>
> I want to point out that the actual amount of data being swapped out is
> fairly small -- note the "size" fields the swap_pager kernel messages.
> There doesn't necessarily have to be a shortage of memory to cause a
> swapout (case in point, see above).
I took a look at the RELENG_8 code responsible for printing this
message: src/sys/vm/swap_pager.c
1067 /*
1068 * SWAP_PAGER_GETPAGES() - bring pages in from swap
1069 *
1070 * Attempt to retrieve (m, count) pages from backing store, but make
1071 * sure we retrieve at least m[reqpage]. We try to load in as large
1072 * a chunk surrounding m[reqpage] as is contiguous in swap and which
1073 * belongs to the same object.
1074 *
1075 * The code is designed for asynchronous operation and
1076 * immediate-notification of 'reqpage' but tends not to be
1077 * used that way. Please do not optimize-out this algorithmic
1078 * feature, I intend to improve on it in the future.
1079 *
1080 * The parent has a single vm_object_pip_add() reference prior to
1081 * calling us and we should return with the same.
1082 *
1083 * The parent has BUSY'd the pages. We should return with 'm'
1084 * left busy, but the others adjusted.
1085 */
1086 static int
1087 swap_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
1088 {
....
1210 /*
1211 * wait for the page we want to complete. VPO_SWAPINPROG is always
1212 * cleared on completion. If an I/O error occurs, SWAPBLK_NONE
1213 * is set in the meta-data.
1214 */
1215 VM_OBJECT_LOCK(object);
1216 while ((mreq->oflags & VPO_SWAPINPROG) != 0) {
1217 mreq->oflags |= VPO_WANTED;
1218 PCPU_INC(cnt.v_intrans);
1219 if (msleep(mreq, VM_OBJECT_MTX(object), PSWP, "swread", hz*20)) {
1220 printf(
1221 "swap_pager: indefinite wait buffer: bufobj: %p, blkno: %jd, size: %ld\n",
1222 bp->b_bufobj, (intmax_t)bp->b_blkno, bp->b_bcount);
1223 }
1224 }
So I believe this indicates the message only gets printed during swapin,
not swapout. Meaning it's happening during an I/O read from da0.
Reading msleep(9) provides us some details about what "swread"
correlates with (now I know where that column in ps/top comes from), and
the timeout value (hz*20):
The parameter wmesg is a string describing the sleep condition for tools
like ps(1). Due to the limited space of those programs to display arbi‐
trary strings, this message should not be longer than 6 characters.
The parameter timo specifies a timeout for the sleep. If timo is not 0,
then the thread will sleep for at most timo / hz seconds. If the timeout
expires, then the sleep function will return EWOULDBLOCK.
So what's hz? Well, I want to assume it's kern.hz, which defaults to
1000. 1000*20 = 20000, so the timeout would be 20000/1000 = 20 seconds.
That's a pretty long time to be waiting for an I/O read to return.
So does vm.swap_idle_threshold1 play a role? I doubt it. The code is
in src/sys/vm/vm_glue.c, but I don't understand it (especially since
it's used in a function called swapout_procs()). I just wish I knew why
the description was "Guaranteed swapped in time for a process" when it
looks more like it's guaranteed swapped out time?
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
More information about the freebsd-stable
mailing list