dealing with a failing drive

Wed Nov 14 17:26:21 PST 2007

From: "Jerry McAllister" <jerrymc at msu.edu>
Sent: Monday, November 12, 2007 12:53

> On Mon, Nov 12, 2007 at 09:26:38AM -0800, David Newman wrote:
> 
>> On 11/12/07 8:14 AM, Jerry McAllister wrote:
>> 
>> > An update: After doing what you suggest (leaving in the "good" disk,
>> > adding a new disk, RAID rebuilding) I still got soft write errors --
>> > with *either one* of the disks I tried.
>> > 
>> > Then I tried putting both disks in an identical server and they came up
>> > fine, no read or write errors.
>> > 
>> > Ergo, the bad RAID controller is bad and the disks may be OK.
>> > 
>> >> Probably not.
>> >> Generally, if the RAID controller is bad, you will see errors
>> >> all over and not it just one place, tho I suppose it is possible.
>> >> Check and see what it reports as error locations and see if they
>> >> move around any.
>> 
>> Jerry, thanks for your response.
>> 
>> After 36 hours of running the same disks in a different, identical
>> machine there hasn't been a single read or write error. I'm hardly a
>> storage expert but from the evidence I have I'm inclined to believe the
>> root cause was a bad RAID controller and not failed disks.
> 
> That is not much proof. 
> The different machine would probably be accessing the disks in
> a different way, either slightly different positioning or using
> different space.   Also, 36 hours is not really much time.

Dn, I have had a Promise controller that was bad. I kept getting errors
at one specific location on two disks out of three on a RAID 5. The
system continued to operate. When I finally spent the time to nail it
down to the controller I found the Promise people more than anxious to
get the beast for a postmortem. It had been bad for me from day one. It
would take about a week to a month for the problem to appear. After the
6th disk showing the problem at the same block number the coin dropped
in my sometimes overly slow mind.

{^_-}    Joanne