ATA_DMA errors

Johny Mattsson lonewolf-freebsd at earthmagic.org
Fri Jun 24 18:39:19 GMT 2005


twesky wrote:
> I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't
> done a cvsup again).  It doesn't happen on 5.3R or lower.

I've just upgraded my fileserver from 5.1-R to 5.4-R, and I'm seeing 
this problem too now on 3 out of 4 drives.


> The exact error message is below:
> 
> It happens within a few hours of use.  The laptop will then reboot,
> and fsck must be ran.  After fsck the timeouts happen within a few
> seconds of booting.

My system uses a "SiI 0680 UDMA133 controller" in addition to the old 
built-in "Intel PIIX4 UDMA33 controller". My system drive hangs off the 
PIIX4 controller and I see no issues with it, only drives off the SiI;

ad0: 8207MB <ST38641A/3.29> [16676/16/63] at ata0-master UDMA33
ad4: 57241MB <ST360021A/3.05> [116301/16/63] at ata2-master UDMA100
ad6: 76319MB <ST380021A/3.19> [155061/16/63] at ata3-master UDMA100
ad7: 152627MB <WDC WD1600JB-00DUA3/75.13B75> [310101/16/63] at 
ata3-slave UDMA100


Right after the upgrade things worked well for a couple of hours, and 
then I got a reboot all of a sudden. Upon inspection I found tons of 
both "READ_DMA timed out" as well as "WRITE_DMA UDMA ICRC error" 
messages in log prior to the reboot. After the reboot it went to do the 
fsck and made it perhaps halfway through it before it started churning 
out READ_DMA timed out messages again, followed by the "ad7: warning - 
removed from configuration" message.

Things did not get better from there, but with each sucessive reboot 
more and more started going wrong. In order to be able to get the system 
to even boot in the end I had to physically disconnect the ad7 drive, 
but even so I'm getting READ_DMA timed out messages for ad4 and ad6.

Since I'm getting WRITE_DMA errors on both ad6 and ad7 now (I haven't 
written anything to ad4 yet, so I don't know if I'll get errors on that 
one too), and I wasn't a few hours ago when I was running 5.1-R, I 
refuse to believe that two disks have gone bad in that timespan!

I'm not sure what I should do at this point - theoretically I could 
proceed to roll back to 5.1 to prevent further data loss, but I'm 
guessing it'd be good if I kept it for a little while so that I could 
run tests for patches :-/


Seeing the comments about possible failing controller hardware, I might 
see if I can find a replacement controller tomorrow... any ideas in the 
meantime will be appreciated though!

Still feels very iffy that this started happening right after the 
upgrade... I was expecting to get rid of some of the quirks from the 
early preview, not get far worse ones! :-(


Oh, btw, using smartmontools' smartctl, I've gotten the information that 
ad4 has had 32 write errors in total, ad6 have had 0 (despite seeing the 
WRITE_DMA errors in the system log), and ad7 refuses to even talk SMART.


###

Here's the contents of the dmesg from before I pulled ad7 out:

Jun 24 18:22:19 kernel: FreeBSD 5.4-RELEASE #0: Sun May 8 10:21:06 UTC 2005
Jun 24 18:22:19 kernel:
root at harlow.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC
Jun 24 18:22:19 kernel: Timecounter "i8254" frequency 1193182 Hz quality 0
Jun 24 18:22:19 kernel: CPU: Pentium II/Pentium II Xeon/Celeron
(467.73-MHz 686-class CPU)
Jun 24 18:22:19 kernel: Origin = "GenuineIntel" Id = 0x665 Stepping = 5
Jun 24 18:22:19 kernel:
Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,S
EP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR>
Jun 24 18:22:19 kernel: real memory = 805240832 (767 MB)
Jun 24 18:22:19 kernel: avail memory = 778231808 (742 MB)
Jun 24 18:22:19 kernel: npx0: <math processor> on motherboard
Jun 24 18:22:19 kernel: npx0: INT 16 interface
Jun 24 18:22:19 kernel: acpi0: <AWARD AWRDACPI> on motherboard
Jun 24 18:22:19 kernel: acpi0: Power Button (fixed)
Jun 24 18:22:19 kernel: Timecounter "ACPI-safe" frequency 3579545 Hz
quality 1000
Jun 24 18:22:19 kernel: acpi_timer0: <24-bit timer at 3.579545MHz> port
0x4008-0x400b on acpi0
Jun 24 18:22:19 kernel: cpu0: <ACPI CPU (3 Cx states)> on acpi0
Jun 24 18:22:19 kernel: acpi_throttle0: <ACPI CPU Throttling> on cpu0
Jun 24 18:22:19 kernel: acpi_button0: <Power Button> on acpi0
Jun 24 18:22:19 kernel: pcib0: <ACPI Host-PCI bridge> port
0x5000-0x500f,0x4000-0x4041,0xcf8-0xcff on acpi0
Jun 24 18:22:19 kernel: pci0: <ACPI PCI bus> on pcib0
Jun 24 18:22:19 kernel: agp0: <Intel 82443BX (440 BX) host to PCI
bridge> mem 0xe0000000-0xe3ffffff at device 0.0 on pci0
Jun 24 18:22:19 kernel: pcib1: <PCI-PCI bridge> at device 1.0 on pci0
Jun 24 18:22:19 kernel: pci1: <PCI bus> on pcib1
Jun 24 18:22:19 kernel: isab0: <PCI-ISA bridge> at device 7.0 on pci0
Jun 24 18:22:19 kernel: isa0: <ISA bus> on isab0
Jun 24 18:22:19 kernel: atapci0: <Intel PIIX4 UDMA33 controller> port
0xf000-0xf00f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 7.1 on pci0
Jun 24 18:22:19 kernel: ata0: channel #0 on atapci0
Jun 24 18:22:19 kernel: ata1: channel #1 on atapci0
Jun 24 18:22:19 kernel: uhci0: <Intel 82371AB/EB (PIIX4) USB controller>
port 0x9000-0x901f irq 11 at device 7.2 on pci0
Jun 24 18:22:19 kernel: usb0: <Intel 82371AB/EB (PIIX4) USB controller>
on uhci0
Jun 24 18:22:19 kernel: usb0: USB revision 1.0
Jun 24 18:22:19 kernel: uhub0: Intel UHCI root hub, class 9/0, rev
1.00/1.00, addr 1
Jun 24 18:22:19 kernel: uhub0: 2 ports with 2 removable, self powered
Jun 24 18:22:19 kernel: pci0: <bridge> at device 7.3 (no driver attached)
Jun 24 18:22:19 kernel: atapci1: <SiI 0680 UDMA133 controller> port
0xa400-0xa40f,0xa000-0xa003,0x9c00-0x9c07,0x9800-0x9803,0x9400-0x9407 
mem 0xe9001000-0xe900
10ff irq 9 at device 10.0 on pci0
Jun 24 18:22:19 kernel: ata2: channel #0 on atapci1
Jun 24 18:22:19 kernel: ata3: channel #1 on atapci1
Jun 24 18:22:19 kernel: pci0: <display, VGA> at device 11.0 (no driver
attached)
Jun 24 18:22:19 kernel: ahc0: <Adaptec 2940 Ultra SCSI adapter> port
0xa800-0xa8ff mem 0xe9000000-0xe9000fff irq 10 at device 12.0 on pci0
Jun 24 18:22:19 kernel: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253
SCBs
Jun 24 18:22:19 kernel: rl0: <RealTek 8139 10/100BaseTX> port
0xac00-0xacff mem 0xe9002000-0xe90020ff irq 11 at device 13.0 on pci0
Jun 24 18:22:19 kernel: miibus0: <MII bus> on rl0
Jun 24 18:22:19 kernel: rlphy0: <RealTek internal media interface> on
miibus0
Jun 24 18:22:19 kernel: rlphy0: 10baseT, 10baseT-FDX, 100baseTX,
100baseTX-FDX, auto
Jun 24 18:22:19 kernel: rl0: Ethernet address: 00:40:f4:28:9d:20
Jun 24 18:22:19 kernel: sio0: <16550A-compatible COM port> port
0x3f8-0x3ff irq 4 flags 0x10 on acpi0
Jun 24 18:22:19 kernel: sio0: type 16550A
Jun 24 18:22:19 kernel: sio1: <16550A-compatible COM port> port
0x2f8-0x2ff irq 3 on acpi0
Jun 24 18:22:19 kernel: sio1: type 16550A
Jun 24 18:22:19 kernel: ppc0: <ECP parallel printer port> port
0x778-0x77b,0x378-0x37b irq 7 drq 3 on acpi0
Jun 24 18:22:19 kernel: ppc0: SMC-like chipset ((ECP/EPP/PS2/NIBBLE) in
COMPATIBLE mode
Jun 24 18:22:19 kernel: ppc0: FIFO with 16/16/16 bytes threshold
Jun 24 18:22:19 kernel: ppbus0: <Parallel port bus> on ppc0
Jun 24 18:22:19 kernel: plip0: <PLIP network interface> on ppbus0
Jun 24 18:22:19 kernel: lpt0: <Printer> on ppbus0
Jun 24 18:22:19 kernel: lpt0: Interrupt-driven port
Jun 24 18:22:19 kernel: ppi0: <Parallel I/O> on ppbus0
Jun 24 18:22:19 kernel: atkbdc0: <Keyboard controller (i8042)> port
0x64,0x60 irq 1 on acpi0
Jun 24 18:22:19 kernel: atkbd0: <AT Keyboard> irq 1 on atkbdc0
Jun 24 18:22:19 kernel: kbd0 at atkbd0
Jun 24 18:22:19 kernel: psm0: <PS/2 Mouse> irq 12 on atkbdc0
Jun 24 18:22:19 kernel: psm0: model IntelliMouse Explorer, device ID 4
Jun 24 18:22:19 kernel: orm0: <ISA Option ROMs> at iomem
0xcd000-0xcd7ff,0xc0000-0xc7fff on isa0
Jun 24 18:22:19 kernel: pmtimer0 on isa0
Jun 24 18:22:19 kernel: fdc0: cannot allocate I/O port (6 ports)
Jun 24 18:22:19 kernel: sc0: <System console> at flags 0x100 on isa0
Jun 24 18:22:19 kernel: sc0: VGA <16 virtual consoles, flags=0x300>
Jun 24 18:22:19 kernel: vga0: <Generic ISA VGA> at port 0x3c0-0x3df
iomem 0xa0000-0xbffff on isa0
Jun 24 18:22:19 kernel: Timecounter "TSC" frequency 467729279 Hz quality 800
Jun 24 18:22:19 kernel: Timecounters tick every 10.000 msec
Jun 24 18:22:19 kernel: ad0: 8207MB <ST38641A/3.29> [16676/16/63] at
ata0-master UDMA33
Jun 24 18:22:19 kernel: ad4: 57241MB <ST360021A/3.05> [116301/16/63] at
ata2-master UDMA100
Jun 24 18:22:19 kernel: ad6: 76319MB <ST380021A/3.19> [155061/16/63] at
ata3-master UDMA100
Jun 24 18:22:19 kernel: ad7: 152627MB <WDC WD1600JB-00DUA3/75.13B75>
[310101/16/63] at ata3-slave UDMA100
Jun 24 18:22:19 kernel: Waiting 15 seconds for SCSI devices to settle
Jun 24 18:22:19 kernel: sa0 at ahc0 bus 0 target 4 lun 0
Jun 24 18:22:19 kernel: sa0: <HP HP35470A 1009> Removable Sequential
Access SCSI-2 device
Jun 24 18:22:19 kernel: sa0: 5.000MB/s transfers (5.000MHz, offset 8)
Jun 24 18:22:19 kernel: sa1 at ahc0 bus 0 target 6 lun 0
Jun 24 18:22:19 kernel: sa1: <SUN DLT4000 CC2E> Removable Sequential
Access SCSI-2 device
Jun 24 18:22:19 kernel: sa1: 10.000MB/s transfers (10.000MHz, offset 15)
Jun 24 18:22:19 kernel: cd0 at ahc0 bus 0 target 5 lun 0
Jun 24 18:22:19 kernel: cd0: <TEAC CD-ROM CD-532S 1.0A> Removable CD-ROM
SCSI-2device
Jun 24 18:22:19 kernel: cd0: 20.000MB/s transfers (20.000MHz, offset 15)
Jun 24 18:22:19 kernel: cd0: Attempt to query device size failed: NOT
READY, Medium not present
Jun 24 18:22:19 kernel: Mounting root from ufs:/dev/ad0s1a


Cheers,
/Johny
-- 
Johny Mattsson - Making IT work  ,-.   ,-.   ,-.  When all else fails,
http://www.earthmagic.org     _.'  `-'   `-'   Murphy's Law still works!



More information about the freebsd-stable mailing list