SATA time outs

Tue Jun 15 18:53:09 UTC 2010

Casey Scott wrote:

> Since upgrading to 8.0 RELEASE, I continually get these errors:
> 
> ...
> Jun 11 15:24:08 xxxx kernel: ad6: 953869MB <Seagate ST31000340AS SD1A> at
> ata3-master SATA150 Jun 11 15:24:08 xxxx kernel: (probe6:ahc0:0:6:0): TEST
> UNIT READY. CDB: 0 0 0 0 0 0 Jun 11 15:24:08 xxxx kernel:
> (probe6:ahc0:0:6:0): CAM Status: SCSI Status Error Jun 11 15:24:08 xxxx
> kernel: (probe6:ahc0:0:6:0): SCSI Status: Check Condition Jun 11 15:24:08
> xxxx kernel: (probe6:ahc0:0:6:0): UNIT ATTENTION asc:29,2 Jun 11 15:24:08
> xxxx kernel: (probe6:ahc0:0:6:0): SCSI bus reset occurred Jun 11 15:24:08
> xxxx kernel: (probe6:ahc0:0:6:0): Retrying Command (per Sense Data) ...
> 
> 
> I've tried 3 different drives w/ 2 different disk controllers. Anything I
> use as the second drive generates this message on boot, and will
> eventually fail with timeout errors after a couple hours.  The other drive
> on the system, ad4, never displays these symptoms. This isn't new
> hardware, and worked flawlessly until now.
> 
> Any suggestions? Has a bug been introduced into the ata driver?
> 

These drives are known to be failing in large numbers, with various forms of 
defective firmwares. The worst is the so-called "self-bricking" feature. Try 
some other kind of drive other than just replacing with more of the same. 
Possibly a firmware flash might help in cases other then the "self-bricking" 
scenario, as once it happens they're done.

Also, I'm very leery of putting "Green" drives in any kind of server 
environment. They spend way to much time parking heads and spinning down. 
Another thing to watch for is using desktop drives with RAID controllers. 
Enterprise drives have a very short timeout period designed to keep them 
from being dropped by the RAID controller:

http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1397

If it is slightly older motherboard/BIOS look and see if these are set to 
"1" in sysctl -a and maybe try toggling in loader,conf like the following:

hw.pci.enable_msi="0"
hw.pci.enable_msix="0"

vmstat -i and look for really outlandish interrupt storm. Hard to tell as 
disk controllers are usually pretty busy here. Newer equipment is supposed 
to be able to operate in a shared interrupt environment. Can try and 
manually sort out so that irq's for the controller aren't shared.

As far as the ATA driver code, if you have recently changed from 7.x to 8.x 
that might be worth considering. If there has been a regression I'm sure a 
PR would be in order. Just a few random thoughts off the top of my head. But 
me, the first thing I'd do is dump the Seagates.

-Mike