7.2-RELEASE-p4, IO errors & RAID1 failure

Jeremy Chadwick freebsd at jdc.parodius.com
Sat Jun 26 17:12:54 UTC 2010


On Sat, Jun 26, 2010 at 04:57:48PM +0100, Matthew Lear wrote:
> On Fri, 2010-06-25 at 00:16 -0700, Jeremy Chadwick wrote:
> > 
> > All in all, replacing a drive is a completely reasonable action when
> > there's evidence confirming the need for its replacement.  I don't like
> > replacing hardware when there's no indication replacing it will
> > necessarily fix the problem; I'd rather understand the problem.
> > 
> > Matthew, if you're able to take the system down for 2-3 hours, I would
> > recommend downloading Western Digital's Data Lifeguard Diagnostics
> > software (for DOS; you'll need a CD burner to burn the ISO) and running
> > that on your drive.  If that fails on a Long/Extended test, yep, replace
> > the disk.  Said utility tests a lot more than just SMART.
> 
> Ok. I've tried this but I think there are some BIOS settings that mean
> that the WD DOS env can't find the license file (I've read several
> postings about this). I'd rather not mess around with BIOS settings on
> the machine I'm trying to restore so I'll remove the drive and plug it
> into another machine and attempt to run the WD's diagnostics on it. I'll
> post the results here if anything interesting crops up.
> 
> > If it passes the test, then we're back at square one, and you can try
> > replacing the disk if you'd like (then boot from the 2nd disk in the
> > RAID-1 array).  My concern is that replacing it isn't going to fix
> > anything (meaning you might have a SATA port that's going bad or the
> > controller itself is broken).
> > 
> 
> Meanwhile, I powered off the RAID 1 machine, removed the [apparently]
> faulty drive (ad0), also removed the 160G drive that was a slave on ATA
> channel 0 (just to simplify things since it wasn't part of the array),
> replaced ad0 with a brand spanking new one (same make/model), switched
> the BIOS to boot from the 2nd disk (ie ad2) and booted the machine.
> Bootmgr started fine, booted the kernel and the machine booted normally.
> atacontrol status on ar0 gives:
> 
> ar0: ATA RAID1 status: DEGRADED
>  subdisks:
>    0 ---- MISSING
>    1 ---- ONLINE
> 
> Importantly, atacontrol did detect that the RAID was degraded at boot
> time:
> 
> ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode
> ar0: 305245MB <Intel MatrixRAID RAID1> status: DEGRADED
> ar0: disk0 DOWN no device found for this subdisk
> ar0: disk1 READY (mirror) using ad2 at ata1-master

Does "atacontrol list" show the existence of disks ad0 and ad2?  If so,
then the message probably indicate "ad0 exists but there's missing
metadata, so I'm ignoring it".  If not, then I have no real explanation
other than it sounds like the SATA controller is broken.

> Just to clarify, the array was created using atacontrol so why it's
> reporting Intel MatrixRAID I have no idea.

Are you absolutely 100% positively certain that your system/motherboard
does not have "SATA RAID" enabled in the system BIOS?  The ar0 "Intel
MatrixRAID" line really has me concerned.  If MatrixRAID is indeed
enabled in the BIOS, then almost all these problems can be explained.

> Trying to rebuild the array with atacontrol rebuild ar0 gives:
> 
> atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error
>
> So I tried to detach channel ata0 and reattach it. This appeared to go
> ok. Trying to rebuild the array again gave the same error as above.

More on this later.

> I found a post on nabble (can't find it now!) where a chap was having
> the same problem rebuilding his RAID1 array using atacontrol rebuild.
> Turns out that because it's a software RAID array, atacontrol rebuild
> won't work. The only recommended way to get the array back on track was
> to dd the contents of the healthy drive onto the new drive. I tried this
> just to see what would happen:
> 
> dd if=/dev/ad2 of=/dev/ad0 bs=1024k
> 
> Seemed to work just fine as expected. I was hoping that after another
> reboot, atacontrol would have seen ad0 as the missing array device on
> chanel 0, done anything required and hey presto, I'd have a health RAID
> 1 array again.
> 
> Sadly, not. atacontrol still insists that the array is DEGRADED despite
> having manually mirrored the contents of ad2 to ad0.

This probably has to do with corrupt/missing/incorrect metadata.  The dd
method (to copy disk X to disk Y) isn't sufficient.

The atacontrol man page states the following for your situation:

   If the system has a pure software array and is not using a "real" ATA
   RAID controller, then shut the system down, make sure that the disk that
   was still working is moved to the bootable position (channel 0 or what‐
   ever the BIOS allows the system to boot from) and the blank disk is
   placed in the secondary position, then boot the system into single-user
   mode and issue the command:

           atacontrol addspare ar0 ad6
           atacontrol rebuild ar0

So I believe what the man page is telling you to do is:

1) Power down the system
2) Physically connect the ad2 (working/has-data) disk to SATA channel 0
3) Physically connect the ad0 (brand-new) disk to SATA channel 1
4) Make mental note that the disk names will now be swapped: ad0 will
   now be the working/has-data disk, and ad2 will be the brand-new disk
5) Power up the system and make sure you're booting from SATA channel 0
5) Go into single-user
6) Execute:
   atacontrol addspare ar0 ad2
   atacontrol rebuild ar0

I have no idea if this will work or not.

If this doesn't work, I'm out of ideas other than restoring from backups
or running in degraded mode to back up your data, then afterward,
rebuild the system using something like gmirror.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |



More information about the freebsd-stable mailing list