7.2-RELEASE-p4, IO errors & RAID1 failure

Matthew Lear matt at bubblegen.co.uk
Sat Jun 26 15:57:52 UTC 2010


On Fri, 2010-06-25 at 00:16 -0700, Jeremy Chadwick wrote:
> 
> All in all, replacing a drive is a completely reasonable action when
> there's evidence confirming the need for its replacement.  I don't like
> replacing hardware when there's no indication replacing it will
> necessarily fix the problem; I'd rather understand the problem.
> 
> Matthew, if you're able to take the system down for 2-3 hours, I would
> recommend downloading Western Digital's Data Lifeguard Diagnostics
> software (for DOS; you'll need a CD burner to burn the ISO) and running
> that on your drive.  If that fails on a Long/Extended test, yep, replace
> the disk.  Said utility tests a lot more than just SMART.

Ok. I've tried this but I think there are some BIOS settings that mean
that the WD DOS env can't find the license file (I've read several
postings about this). I'd rather not mess around with BIOS settings on
the machine I'm trying to restore so I'll remove the drive and plug it
into another machine and attempt to run the WD's diagnostics on it. I'll
post the results here if anything interesting crops up.

> If it passes the test, then we're back at square one, and you can try
> replacing the disk if you'd like (then boot from the 2nd disk in the
> RAID-1 array).  My concern is that replacing it isn't going to fix
> anything (meaning you might have a SATA port that's going bad or the
> controller itself is broken).
> 

Meanwhile, I powered off the RAID 1 machine, removed the [apparently]
faulty drive (ad0), also removed the 160G drive that was a slave on ATA
channel 0 (just to simplify things since it wasn't part of the array),
replaced ad0 with a brand spanking new one (same make/model), switched
the BIOS to boot from the 2nd disk (ie ad2) and booted the machine.
Bootmgr started fine, booted the kernel and the machine booted normally.
atacontrol status on ar0 gives:

ar0: ATA RAID1 status: DEGRADED
 subdisks:
   0 ---- MISSING
   1 ---- ONLINE

Importantly, atacontrol did detect that the RAID was degraded at boot
time:

ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode
ar0: 305245MB <Intel MatrixRAID RAID1> status: DEGRADED
ar0: disk0 DOWN no device found for this subdisk
ar0: disk1 READY (mirror) using ad2 at ata1-master

Just to clarify, the array was created using atacontrol so why it's
reporting Intel MatrixRAID I have no idea.

Trying to rebuild the array with atacontrol rebuild ar0 gives:

atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error

So I tried to detach channel ata0 and reattach it. This appeared to go
ok. Trying to rebuild the array again gave the same error as above.

I found a post on nabble (can't find it now!) where a chap was having
the same problem rebuilding his RAID1 array using atacontrol rebuild.
Turns out that because it's a software RAID array, atacontrol rebuild
won't work. The only recommended way to get the array back on track was
to dd the contents of the healthy drive onto the new drive. I tried this
just to see what would happen:

dd if=/dev/ad2 of=/dev/ad0 bs=1024k

Seemed to work just fine as expected. I was hoping that after another
reboot, atacontrol would have seen ad0 as the missing array device on
chanel 0, done anything required and hey presto, I'd have a health RAID
1 array again.

Sadly, not. atacontrol still insists that the array is DEGRADED despite
having manually mirrored the contents of ad2 to ad0.

Is this a case of RTFM some more or have I missed something? It should
surely be possible to restore the array?!

--  Matt



More information about the freebsd-stable mailing list