7.2-RELEASE-p4, IO errors & RAID1 failure

Fri Jun 18 07:37:19 UTC 2010

Hi there,

I'm running 7.2-RELEASE-p4 on an i386 HP server (ML G5) in RAID1
configuration. Very recently, I've seen IO errors such as:

ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20472527

reported and the RAID mirror is now offline.

ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335
ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR>
error=10<NID_NOT_FOUND> LBA=395032335
ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode

Strangely, I've ran some SMART tests on the device and no error has been
recorded. Health checks pass. Running a long test on the device doesn't
show any problem. While SMART can be manufacturer specific I at least
expected to see something which looked to be suspicious.

The drives in the RAID exist on two seperate ATA channels:
[root at meshuga /home/matt]# atacontrol list
ATA channel 0:
    Master:  ad0 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x
    Slave:   ad1 <FB160C4081/HPF0> SATA revision 1.x
ATA channel 1:
    Master:  ad2 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x
    Slave:       no device present
ATA channel 2:
    Master: acd0 <HL-DT-ST DVDRAM GH22NS40/NL01> SATA revision 1.x
    Slave:       no device present
ATA channel 3:
    Master:      no device present
    Slave:       no device present

ad1 is a third 160G drive that I periodically back up to using cron.

I've seen the thread below but I'm not using ZFS. This seems similar to
what I'm experiencing.
http://freebsd.monkey.org/freebsd-stable/200801/msg00617.html

I'm using software RAID with atacontrol but the drives are not hot-swap.
Therefore I expect that I need to detach ad0 from the RAID, power down
the unit, replace the drive, power on the unit and rebuild the array in
order to fix things. Trouble is, I'm struggling to find out if this can
be done safely with atacontrol and the hw configuration I have, and if
so, how best to do it?

It may well be a case of RTFM (again) but I just wanted to run this by
the community to get some feedback. Loosing data is not an option here
so hopefully I can get the machine back up on its feet soon.

Any help or feedback much appreciated.
Thanks,
--  Matt