"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
joe at skyrush.com
Sat Jan 26 10:32:06 PST 2008
I performed a ZFS scrub, which finished yesterday, and no new
/var/log/messages errors were reported during that time. However, the scrub
found something interesting:
crater# zpool status -v
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008
NAME STATE READ WRITE CKSUM
tank ONLINE 1 3 2
ad0s1d ONLINE 1 3 2
errors: Permanent errors have been detected in the following files:
Note that I have not touched this file since copying it to this drive.
So, it seems one file failed a checksum check during the scrub. I now
(expectedly) get errors trying to read this file - probably ZFS indicating the
condition. When I just logged in tonight, I got two more /var/log/messages
disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just
as I was typing my password).
Also, smartctl still shows PASSED, however, this is interesting:
195 Hardware_ECC_Recovered 0x001a 061 046 000 Old_age Always
The number is much *smaller* now! It was "6" a few minutes before this...
wrap around? Hmm, I'm really not sure, at this point, what is going on.
So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the
drive. The short test passed already. The results should be interesting. If
it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
bugs that just happen to look like drive problems. I already did a long read,
under linux, of disk contents, and got no messages about anything wrong.
If I can turn on any debugging info to help determine if this is
software-related, let me know the magic keywords to use. :)
More information about the freebsd-stable