ZFS + replacing failing hard-drive.

Wed Apr 18 16:18:38 UTC 2007

On Wed, Apr 18, 2007 at 04:41:03PM +0200, St?le Kristoffersen wrote:
> > 
> > I don't think you do.  This appears to be a bug in the ata driver
> > which ZFS is particularly good at triggering.
> 
> I first noticed the problems running UFS an the first partition, and I have
> tried the drive on all of the following controllers:
> atapci0: <SiI 3132 SATA300 controller> port 0xcf00-0xcf7f mem 0xfddff000-0xfddff07f,0xfddf8000-0xfddfbfff irq 19 at device 0.0 on pci4
> atapci1: <JMicron JMB363 SATA300 controller> port 0xaf00-0xaf07,0xae00-0xae03,0xad00-0xad07,0xac00-0xac03,0xab00-0xab0f mem 0xfd9fe000-0xfd9fffff irq 17 at device 0.0 on pci6
> atapci2: <Intel ICH8 SATA300 controller> port 0xfa00-0xfa07,0xf900-0xf903,0xf800-0xf807,0xf700-0xf703,0xf600-0xf60f,0xf500-0xf50f irq 19 at device 31.2 on pci0
> atapci3: <Intel ICH8 SATA300 controller> port 0xf300-0xf307,0xf200-0xf203,0xf100-0xf107,0xf000-0xf003,0xef00-0xef0f,0xee00-0xee0f irq 19 at device 31.5 on pci0
> 
> Same problem on all. And to support my theory that the disk was bad the new
> disk does not behave badly, even after a zpool scrub.

That doesn't prove the disk was/is "bad".  Here I'm using the word "bad" to
mean the disk has had at least 1 non-recoverable failure (i.e. a bad area
on the platter surface was discovered and the drive was unable to remap
it).  As new as SATA300 is, it is doubtful (although possible) that the
drive is "bad"/defective.

> > BTW, the message you show is harmless: see where it says "retrying"?
> > No need to worry until it says "FAILURE - WRITE_DMA timed out".
> 
> Just had a quick peek in the logs and did not find any of them the last
> time, but I do get them:
> Apr 13 21:17:14 fs kernel: ad14: FAILURE - WRITE_DMA48 timed out LBA=719378349
> Apr 13 21:22:23 fs kernel: ad14: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=719341415

I've noticed rarely that the DMA timeouts aren't always reported before
a drive is dropped, and oftentimes DMA timeouts *don't* drop the drive.
The latter case is good cuz I'll stop the disk activity and tell gvinum
to start the disk again, but the former confounds me-- it's never been
reproducable so I couldn't track it down.  It could also just be a syslog
issue.

> Another issue is that even if all the drives support SATA300, and all the
> controllers does so as well, they still come up as SATA150 (except one).
> (And yeah, I have removed that jumper)
> ad8: 305245MB <Seagate ST3320620AS 3.AAC> at ata4-master SATA300
> ad10: 381554MB <Seagate ST3400620AS 3.AAK> at ata5-master SATA150
> ad14: 305245MB <Seagate ST3320620AS 3.AAC> at ata7-master SATA150
> ad15: 305245MB <Seagate ST3320620AS 3.AAC> at ata7-slave SATA150
> ad16: 305245MB <Seagate ST3320620AS 3.AAE> at ata8-master SATA150

I've noticed this behavior on certain controllers (Intel in particular).
Which drives correspond to which controller cards?

-- Rick C. Petty