gmirror refused to connect second disk after a reboot

Sun Jun 6 19:45:18 UTC 2010

On Sun, Jun 06, 2010 at 01:55:51PM -0500, Scott Lambert wrote:
> I have one dual PIII machine doing the same to me.  I've been assuming
> my issue is with the ATA controller.  ...
> 
> Dec 11 02:01:48 netmon kernel: ad2: TIMEOUT - READ_DMA retrying (1 retry left) LBA=232068607
> Dec 11 02:02:00 netmon kernel: ad2: setting PIO4 on ROSB4 chip
> Dec 11 02:02:00 netmon kernel: ad2: setting UDMA33 on ROSB4 chip
> Dec 11 02:02:00 netmon kernel: ad2: TIMEOUT - READ_DMA retrying (1 retry left) LBA=232766751
> Dec 11 02:02:10 netmon kernel: ad0: setting PIO4 on ROSB4 chip
> Dec 11 02:02:10 netmon kernel: ad0: setting UDMA33 on ROSB4 chip
> Dec 11 02:02:10 netmon kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=232006207
> Dec 11 02:02:36 netmon kernel: ad0: setting PIO4 on ROSB4 chip
> Dec 11 02:02:36 netmon kernel: ad0: setting UDMA33 on ROSB4 chip
> Dec 11 02:02:36 netmon kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=242232479
> Dec 11 02:02:37 netmon kernel: ad2: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=242234911
> Dec 11 02:02:37 netmon kernel: ad0: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=242235039
> Dec 11 02:02:37 netmon kernel: ad2: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=242234911
> Dec 11 02:02:37 netmon kernel: ad0: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=242235039
> Dec 11 02:02:37 netmon kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=242234911
> Dec 11 02:02:37 netmon kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=242235039
> Dec 11 02:02:37 netmon kernel: GEOM_MIRROR: Request failed (error=5). ad2[READ(offset=124024274432, length=65536)]
> Dec 11 02:02:37 netmon kernel: GEOM_MIRROR: Device gm0: provider ad2 disconnected.
> Dec 11 02:02:37 netmon kernel: GEOM_MIRROR: Request failed (error=5). ad0[READ(offset=124024339968, length=65536)]
> Dec 11 02:02:37 netmon kernel: g_vfs_done():mirror/gm0s1e[READ(offset=112213082112, length=131072)]error = 5
> Dec 11 02:02:47 netmon kernel: ad0: setting PIO4 on ROSB4 chip
> Dec 11 02:02:47 netmon kernel: ad0: setting UDMA33 on ROSB4 chip
> Dec 11 02:02:47 netmon kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=242234911
> Dec 11 02:02:47 netmon kernel: ad0: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=242235039
> Dec 11 02:02:47 netmon kernel: ad0: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=242235039
> Dec 11 02:02:47 netmon kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=242235039
> Dec 11 02:02:47 netmon kernel: g_vfs_done():mirror/gm0s1e[READ(offset=112213082112, length=131072)]error = 5
> Dec 11 02:02:50 netmon kernel: ad0: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=232478271
> Dec 11 02:02:50 netmon kernel: ad0: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=232478271
> Dec 11 02:02:50 netmon kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=84<ICRC,ABORTED> LBA=232478271
> Dec 11 02:02:50 netmon kernel: g_vfs_done():mirror/gm0s1e[READ(offset=107217682432, length=131072)]error = 5

I agree -- these look like you have either a bad PATA cable, an PATA
controller port which has gone bad, or a PATA controller which is
behaving *very* badly (internal IC problems).  ICRC errors indicate data
transmission failures between the controller and the disk.

Since these are classic PATA disks, ad0 is probably the master and ad2
is the slave -- but both are probably on the same physical cable.

The LBAs for both ad0 and ad2 are very close (ad0=242235039,
ad2=242234911), which makes sense since they're in a mirror config.  But
two disks going kaput at the same time, around the same LBA?  I have my
doubts.

SMART statistics for both of the disks themselves would help determine
if the disks are seeing issues or if the disks are also seeing problems
communicating with the PATA controller.  (Depends on the age of the disks
though; some older PATA disks don't have the SMART attribute that
describes this).

What you should be worried about -- FreeBSD sees problems on both ad0
and ad2.  ad2 is offline cuz of the problem, but ad0 isn't.  Chances are
ad0 is going to fall off the bus eventually because of this problem.  I
really hope you do backups regularly (daily) if you plan on just
ignoring this problem.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |