Intel ICH5 UDMA100 controller TIMEOUT - READ_DMA

Fri May 11 18:11:07 UTC 2007

On Fri, May 11, 2007 at 07:20:06AM -1000, Richard Puga wrote:
> I am working with a new IBM XSeries 226 server.
> 
> It worked fine with the original 80 gig drives.
> 
> Upon replacing them with 2 new Hitichi 500 gig drives I get DMA timouts
> at random times while using the on board Intel SATA controller.
> 
> I put a Promice SATA controller in the machine and everything works
> great.

There's no mention of what FreeBSD version and kernel build date
you're using.  uname -a would be very useful here.

>  kernel: ad3: TIMEOUT - READ_DMA retrying (1 retry left) LBA=0
>  kernel: ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left)
> LBA=324524575
>  kernel: ad2: TIMEOUT - READ_DMA retrying (1 retry left) LBA=3780487
>  kernel: ad2: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=2651511
> and so on....

The interesting part is that the LBAs are all over the place; it's
not sequential, which means (in my opinion) the drive itself is fine.

> atapci1: <Intel ICH5 UDMA100 controller> port
> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x14a0-0x14af at device 31.1 on pci0
> 
> ad4: 476940MB <Hitachi HDT725050VLA360 V56OA73A> at ata2-master SATA150
> ad6: 476940MB <Hitachi HDT725050VLA360 V56OA73A> at ata3-master SATA150

Some clarification:

These drives are not attached to atapci1.  They're attached to a
different PCI device.  UDMA100 is the ATA/IDE port (read: old PATA), not
an SATA port.  What you should be pointing to is something that looks
like this:

atapci0: <Intel ICH5 SATA150 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f irq 18 at device 31.2 on pci0

(The above example is from a machine we have sitting around doing
heavy I/O work due to MySQL.  We have no disk problems there.)

Now...

I have seen similar behaviour to what you've described on an Intel-based
SATA controller (ICH6) with a Western Digital drive that I have
personally used and determined to be reliable on Windows and verified as
such with WD's testing software under DOS too.  I've only seen this
happen *once* on the system.  That system:

FreeBSD eos.sc1.parodius.com 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Mar 8 10:41:09 PST 2007 root at eos.sc1.parodius.com:/usr/obj/usr/src/sys/EOS  i386

atapci0 at pci0:31:2:      class=0x010180 card=0x628015d9 chip=0x26528086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82801FR/FRW ICH6R/ICH6RW SATA Controller'
    class      = mass storage
    subclass   = ATA

Master:  ad0 <WDC WD2500KS-00MJB0/02.01C03> Serial ATA II
Slave:       no device present

ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
g_vfs_done():ad0s1d[WRITE(offset=16821780480, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=16826417152, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=813531136, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=817922048, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=870563840, length=16384)]error = 5

And SMART (smartctl) shows absolutely no signs of any problems with the
drive (the Temperature_Celcius "in_the_past" error is how the drive came
from the factory -- I think Western Digital was doing some testing, who
knows.)

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   214   214   021    Pre-fail  Always       -       4283
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4145
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
190 Temperature_Celsius     0x0022   063   042   045    Old_age   Always   In_the_past 37
194 Temperature_Celsius     0x0022   113   092   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%      3925         -
# 2  Extended offline    Completed without error       00%      3921         -
# 3  Short offline       Completed without error       00%      3920         -
# 4  Short offline       Completed without error       00%      3080         -
# 5  Short offline       Completed without error       00%      3039         -
# 6  Short offline       Completed without error       00%      2898         -
# 7  Short offline       Completed without error       00%      2613         -
# 8  Short offline       Completed without error       00%        43         -

Finally, one can see for RELENG_6 that there are still ongoing changes.
There were some recent ones regarding DMA, but I believe they were for
ATAPI devices and not ATA (disk) devices.

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/

Note my system kernel is from March 8th.  Since then, there's been a lot
of changes regarding DMA, including some "oops I broke this" fixes which
may explain what I am seeing, and maybe what you are too.  Though this
is in regards to 64-bit DMA, and I believe most of my systems (and
yours?) are using 48-bit DMA.

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/ata-dma.c

Soren might know what's going on here though...

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |