Read / write timeouts on SATA disks connected to ICH9

Fri May 14 22:42:38 UTC 2010

On Fri, May 14, 2010 at 11:09:28PM +0200, Pieter de Boer wrote:
> The ad4 SMART output is showing errors, as this disk is indeed
> broken now. It wasn't before and it is a replacement of another disk
> that wasn't broken either. Grmbl, I now see reallocated sectors on
> ad6 as well, in the smartctl output. So both disks look wonky;
> although afaik that's not the main issue here.

Lots to say about all of this.

Focusing on drive ad4 (Western Digital):

The disk has 1 uncorrected sector (Attribute 198).  This means the drive
tried to remap it and was not successful.  This could have happened any
time during the lifetime of the drive.  There are no pending sector
reallocations (Attribute 197) (meaning there aren't others which are bad
which the drive is waiting to attempt remapping for), and there are no
remapped sectors (Attribute 5).  There have been no successful
reallocation attempts during the drive's lifetime (Attribute 196).

In general, I would say this is acceptable.  If Attribute 198 was
higher, or you had other pending sectors which needed to be remapped,
I'd say replace the disk.

UDMA/CRC error count (Attribute 199) is zero.  That's good -- it means
that most likely cabling issues can be ruled out, since the attribute
tracks the number of communication errors between the controller and the
disk PCB.

Drive temperature looks good, so nothing to worry about there.

The drive itself has detected numerous error conditions in the SMART
error log during its lifetime -- a total of 48, but SMART only lists the
most recent 5.  The drive has been online for a total of 827 hours
(Attribute 9), which we can use to determine how recent the drive
experienced said errors.  Let's examine the first 3:

> Error 48 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
>   40 51 00 9d 84 0e e0  Error: UNC at LBA = 0x000e849d = 951453
>   c8 00 20 00 84 0e 00 00      00:45:18.204  READ DMA
> 
> Error 47 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
>   40 51 00 0c 9d 0e e0  Error: UNC at LBA = 0x000e9d0c = 957708
>   c8 00 80 00 9b 0e 00 00      00:03:08.605  READ DMA
> 
> Error 46 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
>   40 51 00 9d 84 0e e0  Error: UNC at LBA = 0x000e849d = 951453
>   c8 00 80 80 82 0e 00 00      00:03:05.176  READ DMA

Okay, it's probably safe to assume these are all signs of the
uncorrected sector.  When a drive attempts a LBA remap -- which in this
case it did, but failed -- it can spend quite a bit of time doing that;
in some cases minutes, not seconds.

The drive essentially "locks up" during this time (from the perspective
of the SATA controller) -- it's literally spending all of its time
trying to read and re-read the LBA/sector in different ways, hoping to
get the data out of it (and/or correct it with ECC) so that it can be
written to a spare block and then internally the bad LBA won't ever be
used again.  What the OS ends up seeing in this situation is disk
timeouts.  This is completely normal.

The WD Caviar Black drives have a useful feature called TLER -- it's
disabled by default, for reasons which I don't want to get into here --
which can force the drive to internally give up after X seconds (it's
user-selectable) when dealing with such remapping/errors.  The idea is
to keep the drive from being deemed dead from the OS/controller's point
of view.  I believe Seagate, Hitachi, or Samsung (I forget which) have
this feature as well, but it's not called TLER.

Anyway, so this is probably the cause of one detachment/timeout you've
seen FreeBSD report.  Let's move on to the 2 remaining errors:

> Error 45 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
>   40 51 08 20 47 6c e0  Error: UNC at LBA = 0x006c4720 = 7096096
>   c4 ff 08 ff 46 6c 00 00      00:01:09.459  READ MULTIPLE
> 
> Error 44 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
>   40 51 08 21 8e 67 e0  Error: UNC at LBA = 0x00678e21 = 6786593
>   c4 ff 04 3f 2f 00 00 00      00:01:00.724  READ MULTIPLE

These two happened around the same time (10 seconds within one another).
I'm under the impression that these are *probably* the result of the
above uncorrected sector issue, but I'm not 100% certain.  Here's why I
think that:

- The errors occurred within the same hour mark (817) as the previous 3
  errors,
- The errors happened only 2 minutes prior to the preceding 3,
- The drive was in the process of executing READ MULTIPLE (cmd 0xc4),
  which tells the disk to read multiple logical sectors within 1 pass.

The ATA-8 specification states that READ MULTIPLE is a PIO command.  I'm
not sure how/why FreeBSD would be submitting this to a disk unless the
communication protocol had been downgraded from DMA to PIO.

mav@ might have some insights on this, as well as how to decode some of
the SMART error data shown.  It looks like the 48-bit read input block
is written in reverse order (word 5 to word 0).

If you want to find out the exact LBA that has the problem (there may be
more than one), I can step you through performing a selective LBA scan
using SMART, since this model of disk does support such.  It's easy to
do, easy to understand the results, and can be done while the drive is
in operation (though I would recommend trying to keep disk I/O to a
minimum during this test).  Let me know.

Focusing on drive ad6 (Seagate):

This drive has 0 uncorrected sectors (Attribute 198); however, I don't
know what Attribute 187 represents on this model of drive.  The disk has
successfully remapped 50 sectors (Attribute 5) during its lifetime.
There are no pending sector reallocations (Attribute 197).

Drive temperature looks good.

I would recommend you consider replacing this drive "down the road"
sometime, since it does indicate 50 reallocated sectors.  One or two is
fine, but 50 is a bit much.  At least the drive was able to re-read them
and reallocate the data successfully.

It's very possible that said remappings caused the drive to time out
during some I/O operations (from the perspective of the controller and
OS); I went over this with regards to ad4.

Finally, your vmstat -i output:

> # vmstat -i
> interrupt                          total       rate
> irq23: atapci0                 371021299      10423

Good to know there's no IRQ sharing going on, but what does worry me is
the interrupt rate (10K interrupts/second).  That seems *extremely*
high, but it also depends on what kind of disk I/O is happening on this
system -- especially since you have 2 disks attached to the same
controller.

"iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you
what kind of disk I/O is going on.  If actual I/O is very little, then
something weird is going on with regards to the number of interrupts
being seen on IRQ 23.  mav@ might have some ideas, otherwise I'd
recommend rebooting the machine and seeing if the number drops.  If so,
it may be that the OS has some sort of bug where a disk timing out or
falling off the bus causes interrupt problems.  (It's too bad you don't
have AHCI on this system.  It handles stuff like this much more
elegantly...)

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |