Apparently spurious ZFS CRC errors (was Re: ZFS data error without reasons)

Fri Mar 20 04:01:20 PDT 2009

On Mon, 16 Mar 2009, kevin wrote:

> My laptop is T61. RAM is also tested by memtest86+ and return no error.

Same here. Memtest fine.

> "zfs send tank/usr/home/kevin at 2009-03-15-16:51:21|zfs receive backup/kevin" 
> hangs system and i have to power off the machine.when the system up,i find 
> file error in snapshot tank/usr/home/kevin at 2009-03-15-16:51:21.when i destroy 
> tank/usr/home/kevin at 2009-03-15-16:51:21,then reboot system, i find more 
> errors.

I've moved a box that was running that has been running FreeBSD 7 with a 
7x1TB drive RAIDZ2 array.
   I've created the same RAIDZ2 with 8-CURRENT and am restoring data from 
tape to the new array (I wanted to rejig the zfs setup). All will appear 
well for a while i.e. no CRC errors, can scrub and rescrub the data whilst 
the data is restoring without problem. I restored the entire 3.5TB from 
tape without error. All data still scrubs fine. Then suddenly I get CRC 
errors on every disk. Repeated scrubs show up different amounts of errors.
   I just couldn't stop them. So I've started again, this time checking 
everything and moving drives onto different controllers to isolate 
problems. I have a gigabyte GA-P35-DS4 MB which has 8xSATA; 6xICH9R & 
2xJMB363. It also has an Sil3132 in there which in previous incarnations 
had the odd drive on it. There's been mention of Sil problems & even 
though the ICH9, JMB363 and Sil3132 had been perfect with 7, I moved 
drives off it:

1. Rebuilt kernel and world from last night; Thu Mar 19 18:27:18 GMT 2009.
2. 6x1B drives on ICH9R
2. 2x500GB on JMB363, striped into 1TB
3. / is ufs on USB KEY
4. created RAIDZ2 again
5. recreated zfs filesystems
6. started restore from tape.

Same again. I can restore data and perform a scrub after each tape (LTO2 
~200GB each) is restored. No errors. Get up to ~350GB, still no errors. 
Then the last scrub I've done throws up:

-----
   pool: pool
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
         attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
         using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h51m with 0 errors on Fri Mar 20 10:57:18 
2009
config:

         NAME             STATE     READ WRITE CKSUM
         pool             ONLINE       0     0     0
           raidz2         ONLINE       0     0    23
             stripe/str0  ONLINE       0     0   489  12.3M repaired
             ad14         ONLINE       0     0   786  19.7M repaired
             ad16         ONLINE       0     0   804  20.1M repaired
             ad18         ONLINE       0     0   754  18.8M repaired
             ad20         ONLINE       0     0   771  19.3M repaired
             ad22         ONLINE       0     0   808  20.2M repaired
             ad24         ONLINE       0     0   848  21.2M repaired

errors: No known data errors
-----

So it happens on both controllers, on plain drives and the stripe. There 
just seems no way to get rid of these errors once they appear. As I said, 
last time I got the whole 3.5TB restored without error, was using it for a 
few days without error, constantly scrubbing to check reliability, then 
once the errors appear there's no way to remove them.
   As this same hardware worked, well with 7 for a long time, and can work 
perfectly with 8 for several days until the errors strike, this seems like 
some curious 8 problem?
   Any help would be appreciated. I'll be happy to provide any further info 
to help debug this. I didn't want to unnecessarily make this any longer 
than it already is.
   Cheers.

-- 
Mark Powell - UNIX System Administrator - The University of Salford
Information & Learning Services, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 6843  Fax: +44 161 295 5888  www.pgp.com for PGP key