Read / write timeouts on SATA disks connected to ICH9

Sat May 15 07:04:17 UTC 2010

Hi Jeremy,

> Lots to say about all of this.

Thanks for your elaborate reply, it was very useful to see smartctl 
output explained a bit :) I still think there's something else in play 
beside disk failure. I've checked one of the drives I replaced earlier, 
but that one doesn't have any of the errors in its SMART output you 
described, although it did drop out of the mirror multiple times during 
its lifetime.

> The WD Caviar Black drives have a useful feature called TLER -- it's
> disabled by default, for reasons which I don't want to get into here --
> which can force the drive to internally give up after X seconds (it's
> user-selectable) when dealing with such remapping/errors.  The idea is
> to keep the drive from being deemed dead from the OS/controller's point
> of view.  I believe Seagate, Hitachi, or Samsung (I forget which) have
> this feature as well, but it's not called TLER.
I've read about this feature, but didn't have the time to try to get it 
turned on (iirc you'd need a specific Western Digital DOS-based util or 
something).

> If you want to find out the exact LBA that has the problem (there may be
> more than one), I can step you through performing a selective LBA scan
> using SMART, since this model of disk does support such.  It's easy to
> do, easy to understand the results, and can be done while the drive is
> in operation (though I would recommend trying to keep disk I/O to a
> minimum during this test).  Let me know.
At a certain point in time I had read errors from specific LBA's on ad4. 
Using dd I was able to pinpoint those to single sectors. Overwriting 
those sectors with what was on ad6 made them readable again. What is odd 
is that the 'remapped sector' count of ad4 is 0.

Still I'd like to know how do perform such a scan.

  > Finally, your vmstat -i output:
> 
>> # vmstat -i
>> interrupt                          total       rate
>> irq23: atapci0                 371021299      10423
> 
> Good to know there's no IRQ sharing going on, but what does worry me is
> the interrupt rate (10K interrupts/second).  That seems *extremely*
> high, but it also depends on what kind of disk I/O is happening on this
> system -- especially since you have 2 disks attached to the same
> controller.
The rate is higher than 10000 also at idle. During a gmirror sync from 
ad6 to ad4, it's about 10670.

> "iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you
> what kind of disk I/O is going on.  If actual I/O is very little, then
> something weird is going on with regards to the number of interrupts
> being seen on IRQ 23.  mav@ might have some ideas, otherwise I'd
> recommend rebooting the machine and seeing if the number drops.  If so,
> it may be that the OS has some sort of bug where a disk timing out or
> falling off the bus causes interrupt problems.  (It's too bad you don't
> have AHCI on this system.  It handles stuff like this much more
> elegantly...)
If mav@ or anyone else doesn't have another insight in the interrupt 
rate, I guess a reboot will at least show if it's persistent or related 
to the errors. I'll try to do a reboot when convenient (probably sunday 
morning or something).

Thanks,
Pieter