7.2-RELEASE-p4, IO errors & RAID1 failure

Jeremy Chadwick freebsd at jdc.parodius.com
Fri Jun 25 07:16:48 UTC 2010

On Thu, Jun 24, 2010 at 05:22:41PM -0500, Adam Vande More wrote:
> Haven't followed the entire thread, but wanted to point out something
> important to remember. SMART is not a reliable indicator of failure.
> It's certainly better than listening to it but it picks up less than
> 1/2 of drive failures. Google released a study of their disks in data
> centers a few years ago that was fairly in depth look into drive
> failure rate. You might find it interesting.

Anyone who relies on "overall SMART health" to determine the status of a
drive will be disappointed when they see what thresholds vendors are
choosing for most attributes.  But that's as far as I'll go when it
comes to agreeing with the "SMART is not a reliable indicator of X"
argument.  Due to vendors choosing what they do, it's best to use SMART
as an indicator of overall drive health *at that moment* and not as a
predictive form (though I have seen it work in this case successfully,
especially on SCSI disks.  I'd be more than happy to provide some
examples if need be).

Google's study was half-ass in some regards (I remember reading it and
feeling left with more questions than answers), and I'm also aware of
folks like Scott Moulton who insist SMART is an unreliable method of
analysis.  I like Scott's work in general, but I disagree with his view
of SMART.  You can see some of his presentations on Youtube; look up
"Shmoocon 2010 DIY Hard Drive Diagnostics".

We've already done the SMART analysis for this issue -- the disk isn't
showing any signs of problems from a SMART perspective.  Meaning,
there's no indication of bad or reallocated sectors, or any other signs
of internal drive failure.

There's a lot of things SMART can't catch -- drive PCB flakiness
(appears as literally anything, take your pick), drive cache going bad
(usually shows itself as abysmal performance), or power-related problems
(though SMART can help catch this by watching at Attributes 4 and 12,
assuming the drive is losing power entirely; if there's dirty power or
excessive ripple, or internal drive power circuitry problems, these can
appear as practically anything).

All in all, replacing a drive is a completely reasonable action when
there's evidence confirming the need for its replacement.  I don't like
replacing hardware when there's no indication replacing it will
necessarily fix the problem; I'd rather understand the problem.

Matthew, if you're able to take the system down for 2-3 hours, I would
recommend downloading Western Digital's Data Lifeguard Diagnostics
software (for DOS; you'll need a CD burner to burn the ISO) and running
that on your drive.  If that fails on a Long/Extended test, yep, replace
the disk.  Said utility tests a lot more than just SMART.

If it passes the test, then we're back at square one, and you can try
replacing the disk if you'd like (then boot from the 2nd disk in the
RAID-1 array).  My concern is that replacing it isn't going to fix
anything (meaning you might have a SATA port that's going bad or the
controller itself is broken).

| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

More information about the freebsd-stable mailing list