Read / write timeouts on SATA disks connected to ICH9
Pieter de Boer
pieter at os3.nl
Sat May 15 20:39:18 UTC 2010
<SNIP: disk without errors timing out>
> That could be caused by a multitude of other known things. For
> example, some Western Digital "Green" drives (including the
> Enterprise class ones) are known to perform head parking/offloading
> excessively, which could result in the drive spending more time doing
> that than actually serving overall I/O requests. There are some
> other reports of Samsung Spinpoint drives experiencing other issues
> (I've since forgotten and would have to dig up the threads).
> If you could provide full SMART stats for that drive, it might help.
Attached the SMART output of both disks I replaced about a month ago. It
appears I replaced perfectly fine drives with the current disks with
errors ;( One of the old disks is in a USB-enclosure now, so 'da0'.
<SNIP: enabling TLER>
> Yes, it's a DOS-based utility (like most firmware upgraders these
> days). I can provide it if you'd like. I've been meaning to spend
> some time trying to reverse-engineer the binary to figure out what
> ATA commands it sends to the disk to toggle/adjust the feature (so
> that one could do it in real-time rather than have to boot into DOS).
I'd like to try that tool. Since the old WD disks are now lying around
at home, I have some time to get a DOS boot working to try it out. A
FreeBSD-implementation of the WD tool and possibly other brands would be
really useful indeed.
>> At a certain point in time I had read errors from specific LBA's on
>> ad4. Using dd I was able to pinpoint those to single sectors.
> This isn't very effective (dd will read large chunks/amounts of data
> (read: multiple LBAs) from the underlying disk at once, rather than
> the disk itself performing a per-LBA test). My opinion is that the
> "dd method" should only be used on drives which don't support
> selective LBA scanning via SMART.
Will dd read multiple LBAs even when using 'bs=512'? The process I used
was reading using bs=8192, then zooming in on the LBA's mentioned in
the errors in dmesg with bs=512 to find the actual LBA.
A selective scan on ad4 did not reveal any errors today: it 'completed
without error'. On ad6 it's a whole lot slower; at the time of writing
it's at 2/3.
> All HD vendors have their own quirks/ordeals right now. You
> basically just have to go with one who works wells for you, then if
> things start going downhill, switch to another. None of them are
I figured as much. What irritates though is that I've had consistent
problems with 4 disks in this specific system, but not (such) issues
with any other disk in other systems I've had. I generally replace disks
when I grow out of them, not because they break down.
> What this indicates to me is that if a disk falls off the bus on an
> ICH9 controller in Enhanced (non-AHCI) mode, FreeBSD starts seeing an
> absurd number of interrupts generated from the ICH9. My guess is
> FreeBSD isn't doing something correctly with the controller when this
> happens; maybe certain commands aren't being sent back to the
> controller or handling of certain events are being done improperly
> when it comes to ICH9 (or possibly earlier ICH revisions too). This
> should be *very* easy to reproduce.
Unfortunately I'm not really in a position to help reproducing this or
testing possible fixes; downtime is currently very unwelcome. Although
one of the previous disks indeed fell of the bus entirely (couldn't get
it back with atacontrol either), that hasn't happened again so far. I
only see timeouts (and a few days ago read errors on ad4) which gmirror
doesn't like. I guess those aren't that simple to reproduce (apart from
on my system ;).
> If you see any of your disks on the ICH9 controller fall off the bus
> or report ATA errors (doesn't matter what kind), please make note of
> the timestamp (should be in the kernel log), and ASAP run "smartctl
> -a" on the disk. You should compare attributes before and after the
> You might also want to consider using smartd, which can log SMART
> attribute changes on its own. Note that you might have to tune the
> arguments in smartd.conf to ignore some attributes which fluctuate
> naturally (such as drive temperature and seek error rate).
I've configured smartd to poll both disks every 5 minutes. I -think- the
issues happen specifically under load: the periodic scripts of the host
and its 4 jails appear to trigger it sometimes. At that time I'm
normally trying to get some sleep, so smartd will have to do for now.
Although I'll run a "smartctl -a" asap anyway.
More information about the freebsd-stable