Stress testing and TIMEOUT - WRITE_DMA

MaXX bs139412 at skynet.be
Mon Sep 12 06:52:53 PDT 2005


On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc at anthonychavez.org> 
wrote:
> My question is simply this: is the fact that I received 4 TIMEOUT
> warnings in the space of roughly 2 weeks significant cause for concern?
Hi,
You may have a look at this pr :85603  (FS corruption and 'uncorrectable' DMA 
errors on ATA disks after unclean shutdown) and see if that applies for you.

Are you running a kernel built around mid June this year?
Did your machine paniced before the DMA problems appears (I think a power 
faillure can do the trick too)?

We were severall usenet user experiencing this kind of problems 
(news://comp.unix.bsd.freebsd.misc thread was named "Disaster Recovery? and 
started 30 Aug 05). If you have the same problem as us, the fix is easy:
- backup your data with tar (will take a while due to timeouts)
- fdisk + newfs 
- reinstall your backup
- cvsup + upgrade your kernel
and thats all... And I was surprised to see my PostgreSQL database coming 
online without a single error message Pg really hate when theFS is 
inconsistent...

In our case this problem was fixed by newfs, even smartctl 
(sysutils/smartmontool) did report errors at the drive level. After newfs'ing 
the disk no more message (but they still in the drive's log). 

Hope this is relevant to your problem...
--
MaXX

I tested my drive as follow:
On comp.unix.bsd.freebsd.misc MaXX wrote:
> I will stress test the drive to see if it still reliable for some purpose.
I've finished some tests on the drive:

1. filled the drive with huge files (11,25,30,10Gb) 3 simultaneous writes =>
no DMA_READ or DMA_WRITE errors; fsck OK

2. copied 18 times /usr/ports with some distfiles and work folders (2
simultaneous copies , 9
times about 4 596 000 files) => no DMA_READ or DMA_WRITE errors; fsck NOT
OK: a bunch of errors which seem to be only at the file system level.

3. md5 sum of 4 596 000 files before corrective fsck: no errors, burning hot
drive

4. clean reboot + fsck: ok; fsck skipped checks.

5. compare md5 before and after reboot: OK, no missing files/folders, newsum
== oldsum.

I the tried to reproduce the initial problem, no way to do it... I killed
init, pulled the plug while writing or reading. No way to get those DMA_*
errors back (Note: the kernel was not the same as the failled one)...

I give up...

Conclusion: the disk is reliable enough to go back to work with a good
backup policy (maybe in a vinum mirror to be sure). The problem seem to be
bound to the kernel the machine was running since mid June 05.
 



More information about the freebsd-stable mailing list