Apparently spurious ZFS CRC errors (was Re: ZFS data error without
reasons)
Mark Powell
M.S.Powell at salford.ac.uk
Fri Mar 20 04:01:20 PDT 2009
On Mon, 16 Mar 2009, kevin wrote:
> My laptop is T61. RAM is also tested by memtest86+ and return no error.
Same here. Memtest fine.
> "zfs send tank/usr/home/kevin at 2009-03-15-16:51:21|zfs receive backup/kevin"
> hangs system and i have to power off the machine.when the system up,i find
> file error in snapshot tank/usr/home/kevin at 2009-03-15-16:51:21.when i destroy
> tank/usr/home/kevin at 2009-03-15-16:51:21,then reboot system, i find more
> errors.
I've moved a box that was running that has been running FreeBSD 7 with a
7x1TB drive RAIDZ2 array.
I've created the same RAIDZ2 with 8-CURRENT and am restoring data from
tape to the new array (I wanted to rejig the zfs setup). All will appear
well for a while i.e. no CRC errors, can scrub and rescrub the data whilst
the data is restoring without problem. I restored the entire 3.5TB from
tape without error. All data still scrubs fine. Then suddenly I get CRC
errors on every disk. Repeated scrubs show up different amounts of errors.
I just couldn't stop them. So I've started again, this time checking
everything and moving drives onto different controllers to isolate
problems. I have a gigabyte GA-P35-DS4 MB which has 8xSATA; 6xICH9R &
2xJMB363. It also has an Sil3132 in there which in previous incarnations
had the odd drive on it. There's been mention of Sil problems & even
though the ICH9, JMB363 and Sil3132 had been perfect with 7, I moved
drives off it:
1. Rebuilt kernel and world from last night; Thu Mar 19 18:27:18 GMT 2009.
2. 6x1B drives on ICH9R
2. 2x500GB on JMB363, striped into 1TB
3. / is ufs on USB KEY
4. created RAIDZ2 again
5. recreated zfs filesystems
6. started restore from tape.
Same again. I can restore data and perform a scrub after each tape (LTO2
~200GB each) is restored. No errors. Get up to ~350GB, still no errors.
Then the last scrub I've done throws up:
-----
pool: pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 0h51m with 0 errors on Fri Mar 20 10:57:18
2009
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
raidz2 ONLINE 0 0 23
stripe/str0 ONLINE 0 0 489 12.3M repaired
ad14 ONLINE 0 0 786 19.7M repaired
ad16 ONLINE 0 0 804 20.1M repaired
ad18 ONLINE 0 0 754 18.8M repaired
ad20 ONLINE 0 0 771 19.3M repaired
ad22 ONLINE 0 0 808 20.2M repaired
ad24 ONLINE 0 0 848 21.2M repaired
errors: No known data errors
-----
So it happens on both controllers, on plain drives and the stripe. There
just seems no way to get rid of these errors once they appear. As I said,
last time I got the whole 3.5TB restored without error, was using it for a
few days without error, constantly scrubbing to check reliability, then
once the errors appear there's no way to remove them.
As this same hardware worked, well with 7 for a long time, and can work
perfectly with 8 for several days until the errors strike, this seems like
some curious 8 problem?
Any help would be appreciated. I'll be happy to provide any further info
to help debug this. I didn't want to unnecessarily make this any longer
than it already is.
Cheers.
--
Mark Powell - UNIX System Administrator - The University of Salford
Information & Learning Services, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 6843 Fax: +44 161 295 5888 www.pgp.com for PGP key
More information about the freebsd-current
mailing list