Stress testing and TIMEOUT - WRITE_DMA

Anthony Chavez acc at anthonychavez.org
Tue Sep 13 17:34:03 PDT 2005


On Mon, 12 Sep 2005 08:19:18 +0200 martin hudec <corwin at aeternal.net> wrote:

> On Sun, Sep 11, 2005 at 10:33:47PM +0200 or thereabouts, Daniel Gerzo wrote:
>> On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc at anthonychavez.org> 
>> wrote:
>> > Sep  6 11:35:27 mybox kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=8348191
>> > ...
>> > Sep  6 18:59:09 mybox kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=8348383
>> > Sep  6 19:04:58 mybox kernel: ad0: TIMEOUT - READ_DMA retrying (2 retries left) LBA=61749183
>> 
>> > The READ_DMA timeouts are happening very infrequently, but it's worth
>> > mentioning that I'm seeing them now in addition.
>> 
>> > This is quite disturbing, particularly when the machine in question is
>> > *in*production.*
>> 
>> I thing you should really quickly look for backuping your data. When
>> I was seeing this kind of messages last time, my disk died after 3
>> days from time they started showing up in my log files. I wasn't able
>> to write any data to the disk (system just sudennly paniced, when
>> I tried to mount it rw, but I was able to mount it ro and copy most of
>> the data) Note, that I wasn't able to copy about 10GB out of 30GB. So
>> don't ignore them and have a good luck.
>
>   Hmmm, before trashing that disk, you could surely consider running
>   smartmontools to see what they have to say about health condition of
>   your disk :).. go for sysutils/smartmontools.

Okay, I've actually got 3 identical drives (SAMSUNG SP0802N) in 3
identical systems, running identical hardware using Intel ICH4
controllers.

Only one of these machines managed to spit 81 errors at me over a period
of about 6.5 hours (so far).  This particular machine produced the
warnings after approximately 8 days after installing FreeBSD.
Ironically, another one of these machines only produced 1 warning after
nearly 21 days and then another solitary warning 14 days after that
(which occurred as I was drafting this response).

smartctl reports each of these drives passes the "SMART overall-health
self-assessment test" but goes on to report exactly 6 "SET MAX ADDRESS
[OBS-6]" errors occur for each drive within 1 hour of uptime.  I do not
think that any of these errors occured at the same time the DMA warnings
did.

>   After that can one make assumptions whether it is faulty hardware or
>   ata patches :).

Well, the drives are pretty much brand new.  I think that it's safe to
assume that the health of these drives are not a concern, and smartctl
seems to confirm this.

On Mon, 12 Sep 2005 15:53:27 +0200 MaXX <bs139412 at skynet.be> wrote:

> On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc at anthonychavez.org> 
> wrote:
>> My question is simply this: is the fact that I received 4 TIMEOUT
>> warnings in the space of roughly 2 weeks significant cause for concern?
> Hi,
> You may have a look at this pr :85603  (FS corruption and 'uncorrectable' DMA 
> errors on ATA disks after unclean shutdown) and see if that applies for you.

Thanks.  My hardware doesn't match, but I'll keep it in mind.

> Are you running a kernel built around mid June this year?

The machine that gave me 81 warnings after applying ata-mk3n:

FreeBSD 5.4-RELEASE-p6 #0: Sun Sep 11 21:57:16 MDT 2005     root at mybox1:/usr/obj/usr/src/sys/MYBOX1

The machine that's been in commission the longest:

FreeBSD 5.4-RELEASE #0: Sun Sep 11 21:46:18 MDT 2005     root at mybox2:/usr/obj/usr/src/sys/MYBOX2

New kid on the block:

FreeBSD 5.4-RELEASE-p6 #0: Sun Sep 11 21:58:08 MDT 2005     root at mybox3:/usr/obj/usr/src/sys/MYBOX3

FWIW, although they have different names, the kernel configs are exactly
the same.

> Did your machine paniced before the DMA problems appears (I think a power 
> faillure can do the trick too)?

No panic.  However, I recall reading that these warnings are a good
indication that a panic may be imminent, hence my call for help.

> In our case this problem was fixed by newfs, even smartctl 
> (sysutils/smartmontool) did report errors at the drive level. After newfs'ing 
> the disk no more message (but they still in the drive's log). 

That seems very strange, particularly when I have newfs'ed the disks
when installing FreeBSD.

Furthermore, this solution is not sufficient.  The machines that are
giving me this error are in crucial locations and I need to know what
causes these errors and if a fix is available or if I really should
worry about a few popping up now and then.

-- 
Anthony Chavez                                 http://anthonychavez.org/
mailto:acc at anthonychavez.org         jabber:acc at jabber.anthonychavez.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 477 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20050913/88463266/attachment.bin


More information about the freebsd-stable mailing list