Read / write timeouts on SATA disks connected to ICH9
Pieter de Boer
pieter at os3.nl
Sat May 15 07:04:17 UTC 2010
Hi Jeremy,
> Lots to say about all of this.
Thanks for your elaborate reply, it was very useful to see smartctl
output explained a bit :) I still think there's something else in play
beside disk failure. I've checked one of the drives I replaced earlier,
but that one doesn't have any of the errors in its SMART output you
described, although it did drop out of the mirror multiple times during
its lifetime.
> The WD Caviar Black drives have a useful feature called TLER -- it's
> disabled by default, for reasons which I don't want to get into here --
> which can force the drive to internally give up after X seconds (it's
> user-selectable) when dealing with such remapping/errors. The idea is
> to keep the drive from being deemed dead from the OS/controller's point
> of view. I believe Seagate, Hitachi, or Samsung (I forget which) have
> this feature as well, but it's not called TLER.
I've read about this feature, but didn't have the time to try to get it
turned on (iirc you'd need a specific Western Digital DOS-based util or
something).
> If you want to find out the exact LBA that has the problem (there may be
> more than one), I can step you through performing a selective LBA scan
> using SMART, since this model of disk does support such. It's easy to
> do, easy to understand the results, and can be done while the drive is
> in operation (though I would recommend trying to keep disk I/O to a
> minimum during this test). Let me know.
At a certain point in time I had read errors from specific LBA's on ad4.
Using dd I was able to pinpoint those to single sectors. Overwriting
those sectors with what was on ad6 made them readable again. What is odd
is that the 'remapped sector' count of ad4 is 0.
Still I'd like to know how do perform such a scan.
> Finally, your vmstat -i output:
>
>> # vmstat -i
>> interrupt total rate
>> irq23: atapci0 371021299 10423
>
> Good to know there's no IRQ sharing going on, but what does worry me is
> the interrupt rate (10K interrupts/second). That seems *extremely*
> high, but it also depends on what kind of disk I/O is happening on this
> system -- especially since you have 2 disks attached to the same
> controller.
The rate is higher than 10000 also at idle. During a gmirror sync from
ad6 to ad4, it's about 10670.
> "iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you
> what kind of disk I/O is going on. If actual I/O is very little, then
> something weird is going on with regards to the number of interrupts
> being seen on IRQ 23. mav@ might have some ideas, otherwise I'd
> recommend rebooting the machine and seeing if the number drops. If so,
> it may be that the OS has some sort of bug where a disk timing out or
> falling off the bus causes interrupt problems. (It's too bad you don't
> have AHCI on this system. It handles stuff like this much more
> elegantly...)
If mav@ or anyone else doesn't have another insight in the interrupt
rate, I guess a reboot will at least show if it's persistent or related
to the errors. I'll try to do a reboot when convenient (probably sunday
morning or something).
Thanks,
Pieter
More information about the freebsd-stable
mailing list