Stress testing and TIMEOUT - WRITE_DMA
bs139412 at skynet.be
Mon Sep 12 06:52:53 PDT 2005
On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc at anthonychavez.org>
> My question is simply this: is the fact that I received 4 TIMEOUT
> warnings in the space of roughly 2 weeks significant cause for concern?
You may have a look at this pr :85603 (FS corruption and 'uncorrectable' DMA
errors on ATA disks after unclean shutdown) and see if that applies for you.
Are you running a kernel built around mid June this year?
Did your machine paniced before the DMA problems appears (I think a power
faillure can do the trick too)?
We were severall usenet user experiencing this kind of problems
(news://comp.unix.bsd.freebsd.misc thread was named "Disaster Recovery? and
started 30 Aug 05). If you have the same problem as us, the fix is easy:
- backup your data with tar (will take a while due to timeouts)
- fdisk + newfs
- reinstall your backup
- cvsup + upgrade your kernel
and thats all... And I was surprised to see my PostgreSQL database coming
online without a single error message Pg really hate when theFS is
In our case this problem was fixed by newfs, even smartctl
(sysutils/smartmontool) did report errors at the drive level. After newfs'ing
the disk no more message (but they still in the drive's log).
Hope this is relevant to your problem...
I tested my drive as follow:
On comp.unix.bsd.freebsd.misc MaXX wrote:
> I will stress test the drive to see if it still reliable for some purpose.
I've finished some tests on the drive:
1. filled the drive with huge files (11,25,30,10Gb) 3 simultaneous writes =>
no DMA_READ or DMA_WRITE errors; fsck OK
2. copied 18 times /usr/ports with some distfiles and work folders (2
simultaneous copies , 9
times about 4 596 000 files) => no DMA_READ or DMA_WRITE errors; fsck NOT
OK: a bunch of errors which seem to be only at the file system level.
3. md5 sum of 4 596 000 files before corrective fsck: no errors, burning hot
4. clean reboot + fsck: ok; fsck skipped checks.
5. compare md5 before and after reboot: OK, no missing files/folders, newsum
I the tried to reproduce the initial problem, no way to do it... I killed
init, pulled the plug while writing or reading. No way to get those DMA_*
errors back (Note: the kernel was not the same as the failled one)...
I give up...
Conclusion: the disk is reliable enough to go back to work with a good
backup policy (maybe in a vinum mirror to be sure). The problem seem to be
bound to the kernel the machine was running since mid June 05.
More information about the freebsd-stable