[long] ATA timeout problems on -STABLE

Tue Sep 21 08:10:30 PDT 2004

Hi,

thanks for your response.

On Mon, Sep 20, 2004 at 11:35:52AM -0400, Paul Mather wrote:
> FWIW, when I would get those errors on my 4-STABLE system (fallback to
> PIO mode; hard error reading fbsn) it did turn out to be a drive problem
> (and with a Maxtor drive, too).  I was none the wiser until I happened
> to reboot the machine after a security advisory upgrade and was
> surprised to see the boot halted because the S.M.A.R.T. status of the
> drive indicated it was failing!  (Prior to that I'd just been assuming
> it was some kind of OS/peak load problem and had been using atacontrol
> to change the mode back to UDMA100 when it fell back to PIO.)

Interesting. I have also done the same in the few rare cases where the drive
would indeed read/write the block in PIO mode. Most of the time the ata
subsystem would just give up on the drive.

> So, I would suggest running smartctl from the sysutils/smartmontools
> port to see what the SMART status of the drives looks like; in
> particular, whether any of the "worst" values have dropped anywhere
> close to the failure threshold value.  (I have noticed with smartctl
> that some attributes go down and then back up.  I have a system, in
> particular, where the Raw_Read_Error_Rate attribute sometimes drops down
> a few points under heavy disk load [e.g., during the nightly backup or
> cvsup], but increases again after the load has lifted.)
> 
> Unfortunately, you're running 4.x, so you might have to make a 5.x
> FreeSBIE CD with the smartmontools port included because it requires
> ATAng from 5.x to run.

That's a great suggestion that hadn't crossed my mind.

As the box had another error just this morning I took some time when I had to
take it offline to rebuild the RAID array, and put the 4 120G disks (which
definitely generate the most errors) in a 5.x system with the smartmontools
port installed.

Logs of smartctl -a are up at

http://sandcat.nl/~stijn/freebsd/ataproblem/

I don't have a clue how to interpret all these numbers though. A little
googling turns up posts that UNC errors are Bad(TM), however that would
indicate that I have indeed 3(!) failing drives on my hands... Although
certainly possible (they are about 1-2 years old in continuous use), it does
sound improbable.

> You can also use smartctl to run online and offline self-tests.

I didn't have time to run the long tests, but all 4 drives indicated a
'passed' status for the online 'smartctl -t short' test. I take it the
long tests give better results? If so I'll take the time to run them
on the next rebuild downtime.

But anyway if the drives are dying, I'll accept that. I just don't know for
sure how to determine that. Do you have pointers for me to read more about
SMART statistics?

--Stijn

-- 
An Orb is for life, not just for Christmas.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20040921/691d44f5/attachment.bin