deadlock or bad disk ? RELENG_8

Mon Jul 19 20:17:01 UTC 2010

On Mon, Jul 19, 2010 at 08:41:40AM -0400, Mike Tancsa wrote:
> At 11:58 PM 7/18/2010, Jeremy Chadwick wrote:
> 
> >So I believe this indicates the message only gets printed during swapin,
> >not swapout.  Meaning it's happening during an I/O read from da0.
> 
> Yes, and from my existing ssh sessions, it would _seem_ no disk IO
> was completing.  ie I tried a killall -9 watchdogd which would need
> to load killall from the disk, read whatever its linked against.
> However, after hitting enter it was just blocking on trying to read.
> So I would describe it as if the entire system was waiting from that
> "swapper Indefinite wait" to finish, or I could not read anything
> from drives associated with that controller.

Hmm, okay, so it sounds like the controller wedged or arcmsr(4) started
acting oddly.  I would open up a case with Areca on the problem,
*especially* if it happens again.

> >So what's hz?  Well, I want to assume it's kern.hz, which defaults to
> >1000.  1000*20 = 20000, so the timeout would be 20000/1000 = 20 seconds.
> >That's a pretty long time to be waiting for an I/O read to return.
> 
> I think the messages were printing to the serial console faster than
> that, but I could be wrong. If it happens again, I will time it

Come to think of it, I'm betting you'd get large batches of these
messages if/when it happens.  That VM code isn't something I'm familiar
with (nor msleep(9)), I just happen to dig around and find what I can.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |