Intel ICH5 UDMA100 controller TIMEOUT - READ_DMA
Jeremy Chadwick
koitsu at FreeBSD.org
Fri May 11 18:11:07 UTC 2007
On Fri, May 11, 2007 at 07:20:06AM -1000, Richard Puga wrote:
> I am working with a new IBM XSeries 226 server.
>
> It worked fine with the original 80 gig drives.
>
> Upon replacing them with 2 new Hitichi 500 gig drives I get DMA timouts
> at random times while using the on board Intel SATA controller.
>
> I put a Promice SATA controller in the machine and everything works
> great.
There's no mention of what FreeBSD version and kernel build date
you're using. uname -a would be very useful here.
> kernel: ad3: TIMEOUT - READ_DMA retrying (1 retry left) LBA=0
> kernel: ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left)
> LBA=324524575
> kernel: ad2: TIMEOUT - READ_DMA retrying (1 retry left) LBA=3780487
> kernel: ad2: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=2651511
> and so on....
The interesting part is that the LBAs are all over the place; it's
not sequential, which means (in my opinion) the drive itself is fine.
> atapci1: <Intel ICH5 UDMA100 controller> port
> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x14a0-0x14af at device 31.1 on pci0
>
> ad4: 476940MB <Hitachi HDT725050VLA360 V56OA73A> at ata2-master SATA150
> ad6: 476940MB <Hitachi HDT725050VLA360 V56OA73A> at ata3-master SATA150
Some clarification:
These drives are not attached to atapci1. They're attached to a
different PCI device. UDMA100 is the ATA/IDE port (read: old PATA), not
an SATA port. What you should be pointing to is something that looks
like this:
atapci0: <Intel ICH5 SATA150 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f irq 18 at device 31.2 on pci0
(The above example is from a machine we have sitting around doing
heavy I/O work due to MySQL. We have no disk problems there.)
Now...
I have seen similar behaviour to what you've described on an Intel-based
SATA controller (ICH6) with a Western Digital drive that I have
personally used and determined to be reliable on Windows and verified as
such with WD's testing software under DOS too. I've only seen this
happen *once* on the system. That system:
FreeBSD eos.sc1.parodius.com 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Mar 8 10:41:09 PST 2007 root at eos.sc1.parodius.com:/usr/obj/usr/src/sys/EOS i386
atapci0 at pci0:31:2: class=0x010180 card=0x628015d9 chip=0x26528086 rev=0x03 hdr=0x00
vendor = 'Intel Corporation'
device = '82801FR/FRW ICH6R/ICH6RW SATA Controller'
class = mass storage
subclass = ATA
Master: ad0 <WDC WD2500KS-00MJB0/02.01C03> Serial ATA II
Slave: no device present
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
g_vfs_done():ad0s1d[WRITE(offset=16821780480, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=16826417152, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=813531136, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=817922048, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=870563840, length=16384)]error = 5
And SMART (smartctl) shows absolutely no signs of any problems with the
drive (the Temperature_Celcius "in_the_past" error is how the drive came
from the factory -- I think Western Digital was doing some testing, who
knows.)
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 214 214 021 Pre-fail Always - 4283
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 9
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4145
10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
190 Temperature_Celsius 0x0022 063 042 045 Old_age Always In_the_past 37
194 Temperature_Celsius 0x0022 113 092 000 Old_age Always - 37
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 3925 -
# 2 Extended offline Completed without error 00% 3921 -
# 3 Short offline Completed without error 00% 3920 -
# 4 Short offline Completed without error 00% 3080 -
# 5 Short offline Completed without error 00% 3039 -
# 6 Short offline Completed without error 00% 2898 -
# 7 Short offline Completed without error 00% 2613 -
# 8 Short offline Completed without error 00% 43 -
Finally, one can see for RELENG_6 that there are still ongoing changes.
There were some recent ones regarding DMA, but I believe they were for
ATAPI devices and not ATA (disk) devices.
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/
Note my system kernel is from March 8th. Since then, there's been a lot
of changes regarding DMA, including some "oops I broke this" fixes which
may explain what I am seeing, and maybe what you are too. Though this
is in regards to 64-bit DMA, and I believe most of my systems (and
yours?) are using 48-bit DMA.
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/ata-dma.c
Soren might know what's going on here though...
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
More information about the freebsd-stable
mailing list