"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

Sat Jan 26 10:32:06 PST 2008

I performed a ZFS scrub, which finished yesterday, and no new
/var/log/messages errors were reported during that time.  However, the scrub
found something interesting:

crater# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       1     3     2
          ad0s1d    ONLINE       1     3     2

errors: Permanent errors have been detected in the following files:

/home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_
Bachelor_Pad/07-Snowfall.mp3

Note that I have not touched this file since copying it to this drive.

So, it seems one file failed a checksum check during the scrub.  I now
(expectedly) get errors trying to read this file - probably ZFS indicating the
condition.  When I just logged in tonight, I got two more /var/log/messages
disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just
as I was typing my password).

Also, smartctl still shows PASSED, however, this is interesting:

195 Hardware_ECC_Recovered  0x001a   061   046   000    Old_age   Always
      -       9070

The number is much *smaller* now!  It was "6" a few minutes before this...
wrap around?  Hmm, I'm really not sure, at this point, what is going on.

So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the
drive.  The short test passed already.  The results should be interesting.  If
it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
bugs that just happen to look like drive problems.  I already did a long read,
under linux, of disk contents, and got no messages about anything wrong.

If I can turn on any debugging info to help determine if this is
software-related, let me know the magic keywords to use.  :)

							-Joe