deadlock or bad disk ? RELENG_8

Mon Jul 19 03:58:47 UTC 2010

On Sun, Jul 18, 2010 at 08:34:24PM -0700, Jeremy Chadwick wrote:
> On Sun, Jul 18, 2010 at 11:01:03PM -0400, Mike Tancsa wrote:
> > I do track some basic mem stats via rrd.  Looking at the graphs upto
> > that period, nothing unusual was happening
> 
> sysctl vm.stats.vm | grep swap
> 
> Here's another post basically reiterating the same thing: that the
> controller the swap slice is on (in your case a 4-disk RAID array) is
> basically taking too long to respond.
> 
> http://groups.google.com/group/mailing.freebsd.stable/browse_thread/thread/2e7faeeaca719c52/cdcd4601ce1b90c5
> 
> I have no idea where the timeout values are in the kernel.  I do see
> these two entries in sysctl that look to be of interest though.  You
> might try adjusting these (not sure if they're sysctls or loader.conf
> tunables only):
> 
> vm.swap_idle_threshold2: 10
> vm.swap_idle_threshold1: 2
> 
> Descriptions:
> 
> vm.swap_idle_threshold2: Time before a process will be swapped out
> vm.swap_idle_threshold1: Guaranteed swapped in time for a process
> 
> I want to point out that the actual amount of data being swapped out is
> fairly small -- note the "size" fields the swap_pager kernel messages.
> There doesn't necessarily have to be a shortage of memory to cause a
> swapout (case in point, see above).

I took a look at the RELENG_8 code responsible for printing this
message: src/sys/vm/swap_pager.c

1067 /*
1068  * SWAP_PAGER_GETPAGES() - bring pages in from swap
1069  *
1070  *      Attempt to retrieve (m, count) pages from backing store, but make
1071  *      sure we retrieve at least m[reqpage].  We try to load in as large
1072  *      a chunk surrounding m[reqpage] as is contiguous in swap and which
1073  *      belongs to the same object.
1074  *
1075  *      The code is designed for asynchronous operation and
1076  *      immediate-notification of 'reqpage' but tends not to be
1077  *      used that way.  Please do not optimize-out this algorithmic
1078  *      feature, I intend to improve on it in the future.
1079  *
1080  *      The parent has a single vm_object_pip_add() reference prior to
1081  *      calling us and we should return with the same.
1082  *
1083  *      The parent has BUSY'd the pages.  We should return with 'm'
1084  *      left busy, but the others adjusted.
1085  */
1086 static int
1087 swap_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
1088 {
....
1210         /*
1211          * wait for the page we want to complete.  VPO_SWAPINPROG is always
1212          * cleared on completion.  If an I/O error occurs, SWAPBLK_NONE
1213          * is set in the meta-data.
1214          */
1215         VM_OBJECT_LOCK(object);
1216         while ((mreq->oflags & VPO_SWAPINPROG) != 0) {
1217                 mreq->oflags |= VPO_WANTED;
1218                 PCPU_INC(cnt.v_intrans);
1219                 if (msleep(mreq, VM_OBJECT_MTX(object), PSWP, "swread", hz*20)) {
1220                         printf(
1221 "swap_pager: indefinite wait buffer: bufobj: %p, blkno: %jd, size: %ld\n",
1222                             bp->b_bufobj, (intmax_t)bp->b_blkno, bp->b_bcount);
1223                 }
1224         }

So I believe this indicates the message only gets printed during swapin,
not swapout.  Meaning it's happening during an I/O read from da0.

Reading msleep(9) provides us some details about what "swread"
correlates with (now I know where that column in ps/top comes from), and
the timeout value (hz*20):

  The parameter wmesg is a string describing the sleep condition for tools
  like ps(1).  Due to the limited space of those programs to display arbi‐
  trary strings, this message should not be longer than 6 characters.

  The parameter timo specifies a timeout for the sleep.  If timo is not 0,
  then the thread will sleep for at most timo / hz seconds.  If the timeout
  expires, then the sleep function will return EWOULDBLOCK.

So what's hz?  Well, I want to assume it's kern.hz, which defaults to
1000.  1000*20 = 20000, so the timeout would be 20000/1000 = 20 seconds.
That's a pretty long time to be waiting for an I/O read to return.

So does vm.swap_idle_threshold1 play a role?  I doubt it.  The code is
in src/sys/vm/vm_glue.c, but I don't understand it (especially since
it's used in a function called swapout_procs()).  I just wish I knew why
the description was "Guaranteed swapped in time for a process" when it
looks more like it's guaranteed swapped out time?

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |