Too many uncorrectable read errors with atang

Fri Nov 7 10:10:10 PST 2003

Since upgrading the bento package machines to -current I am getting
a lot of the following errors:

ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>

For example:

ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
ad0: TIMEOUT - READ_DMA retrying (2 retries left)
ata0: resetting devices ..
ad0: FAILURE - already active DMA on this device
ad0: setting up DMA failed
panic: initiate_write_inodeblock_ufs2: already started
Debugger("panic")
Stopped at      Debugger+0x54:  xchgl   %ebx,in_Debugger.0
db> trace
Debugger(c0739e72,c07ac4a0,c074d9d0,d897b7a4,100) at Debugger+0x54
panic(c074d9d0,c058d793,d897b7cc,c058d72b,c07af7e0) at panic+0xd5
initiate_write_inodeblock_ufs2(c54c8780,cec0f1e8,1,c5a88400,c46f2b40) at initiate_write_inodeblock_ufs2+0x6e6
softdep_disk_io_initiation(cec0f1e8,c073916a,167,1,fcf58000) at softdep_disk_io_initiation+0x8d
spec_xstrategy(c4ed3b68,cec0f1e8,c13e6720,c4e791bc,200200a0) at spec_xstrategy+0x117
spec_specstrategy(d897b8ec,d897b914,c05adbf4,d897b8ec,1) at spec_specstrategy+0x72
spec_vnoperate(d897b8ec,1,c073ff9e,360,0) at spec_vnoperate+0x18
bwrite(cec0f1e8,cec0f1e8,1,8000,0) at bwrite+0x424
ffs_update(c5aab490,1,d897b9b0,c058d72b,c07af880) at ffs_update+0x31b
ffs_truncate(c5aab490,0,0,c00,0) at ffs_truncate+0x8d8
ufs_inactive(d897bbfc,d897bc18,c05c1a13,d897bbfc,0) at ufs_inactive+0x10c
ufs_vnoperate(d897bbfc,0,c074185c,8d3,c07953a0) at ufs_vnoperate+0x18
vput(c5aab490,825d2,0,d897bc38,c074185c) at vput+0x143
handle_workitem_remove(c5b40a20,0,2,c07afa88,c4e63800) at handle_workitem_remove+0x1d1
process_worklist_item(0,0,3faba10a,0,d897bcf0) at process_worklist_item+0x19e
softdep_process_worklist(0,0,c074185c,6e0,0) at softdep_process_worklist+0xe0
sched_sync(0,d897bd48,c0737724,311,aaf2e368) at sched_sync+0x384
fork_exit(c05c0770,0,d897bd48) at fork_exit+0xb4
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xd897bd7c, ebp = 0 ---
db>

So far this has happened (well, the panic above was new) on 5 separate
machines that were all working on older -current.  Now, these are all
IBM DeathStar drives, but previously I was only experiencing ata
errors every month or two, and they were correctable for another month
or two by /dev/zero'ing the drive.

To suddenly start receiving errors on 5 out of 7 drives in the past
few weeks is a significant anomaly.  Perhaps one of the following is
happening:

1) All my drives have performed mass suicide at once

2) ATAng is detecting errors that the ATAog did not

3) ATAng is not trying as hard as ATAog to recover from the errors
from the crappy drives

4) ATAng has a bug on this hardware.

Furthermore, I'd like to know why the panic occurred above.

Following is an excerpt from boot -v:

atapci0: <Intel ICH UDMA66 controller> port 0xffa0-0xffaf at device 31.1 on pci0
ata0: reset tp1 mask=03 ostat0=50 ostat1=00
ata0-master: stat=0x50 err=0x01 lsb=0x00 msb=0x00
ata0-slave:  stat=0x00 err=0x01 lsb=0x00 msb=0x00
ata0: reset tp2 mask=03 stat0=50 stat1=00 devices=0x1<ATA_MASTER>
ata0: at 0x1f0 irq 14 on atapci0
ata0: [MPSAFE]
ata1: at 0x170 irq 15 on atapci0
ata1: [MPSAFE]
[...]
ata0-master: pio=0x0c wdma=0x22 udma=0x45 cable=80pin
ad0: setting UDMA66 on Intel ICH chip
GEOM: create disk ad0 dp=0xc47a4070
ad0: <IBM-DTLA-307030/TX4OA50C> ATA-5 disk at ata0-master
ad0: 29314MB (60036480 sectors), 59560 C, 16 H, 63 S, 512 B
ad0: 16 secs/int, 1 depth queue, UDMA66
GEOM: new disk ad0
GEOM: Configure ad0b, start 0 length 1073741824 end 1073741823
GEOM: Configure ad0c, start 0 length 30738677760 end 30738677759
GEOM: Configure ad0e, start 1073741824 length 2147483648 end 3221225471
GEOM: Configure ad0f, start 3221225472 length 27517452288 end 30738677759

Kris

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20031107/f1ca746d/attachment.bin