Constant minor ZFS corruption

Tue Mar 8 22:40:03 UTC 2011

Don't rule out disk corruption here.

I've applied close to 50 Seagate 1.5 TB drives through ZFS now, and I've had ZFS reject 3 of them without ever being able to find a problem with these drives with any other diagnostic program (Seatool, all disk checks, etc), including some of our data recovery software that gives us very detailed information about errors. 

It was always these three drives that acted up, and when I pulled them and replaced with fresh ones, my checksum errors stopped.

This was under 9-CURRENT, probably the Dec 12-12 build that I experimented with for a while. 

It shouldn't matter which NFS client you use, if you're seeing ZFS checksum errors in zpool status, that won't be from whatever program is writing it. 

Have you make sure it's not always the same drives with the checksum errors? It make take a few days to know for sure..

..oh, and don't forget about fun with Expanders. I assume you're using one?

I've got 2 LSI2008 based controllers in my 9-Current machine without any fuss. That's running a 24 disk Mirror right now. 

-----Original Message-----
From: owner-freebsd-fs at freebsd.org [mailto:owner-freebsd-fs at freebsd.org] On Behalf Of Stephen McKay
Sent: Tuesday, March 08, 2011 10:25 AM
To: freebsd-fs at freebsd.org
Cc: Stephen McKay
Subject: Constant minor ZFS corruption

Hi!

At work I've built a few ZFS based FreeBSD boxes, culminating in a
decent sized rack mount server with 12 2TB disks.   Unfortunately,
I can't make this server stable.

Over the last week or so I've repeated a cycle of:

 1) copy 1TB of data from an NFS mount into a ZFS filesystem
 2) scrub

So far, every one of these has exhibited checksum errors in one or both stages.

Using smartmontools, I know that none of the disks have reported any errors.  No errors are reported by the disk drivers either (ahci and mps).  No ECC (MCA) errors are reported.  The problem occurs with 8.2.0 and with 9-current (note: I kept the zfsv15 pool).  I've swapped the memory with a different brand and I'm now running it at low speed (800MHz).
I disabled hyperthreading and all the other funky CPU related things I could find in the BIOS.  I've tried the normal and "new experimental"
NFS client (mount -t newnfs ...)  Nothing so far has had any effect.

At all times I can build "world" with no errors, even if I put in stupidly high parallel "-j" values and cause severe swapping.
I tried both with the source on ufs and with it on zfs.  No problems.
So the hardware seems generally stable.

I wrote a program to generate streams of pseudorandom trash (using
srandom() and random()).  I generated a TB of this onto the ZFS pool and read it back.  No problems.  I even did a few hundred GB to two files in parallel.  Again, no problems.  So ZFS itself seems generally sound.

However, copying 1TB of data from NFS to ZFS always corrupts just a few blocks, as reported by ZFS during the copy, or in the subsequent scrub.
These corrupted blocks may be on any disk or disks, and are not limited to just one controller or a subset of disks or to one vdev.  ZFS has always successfully reconstructed the data, but I'm hoping to use that redundancy to guard against failing disks, not against whatever gremlin is scrambling my data on the way in.

The hardware is:

Asus P7F-E (includes 6 3Gb/s SATA ports)
PIKE2008 (8 port SAS card based on LSI2008 chip, supports 6Gb/s) Xeon X3440 (2.53GHz 4core with hyperthreading) Chenbro CSPC-41416AB rackmount case 2x 2GB 1333MHz ECC DDR3 RAM (Corsair) (currently using 1x 2GB Kingston ECC RAM) 2x Seagate ST3500418AS 500GB normal disks, for OS booting 12x Seagate ST2000DL003 2TB "green" disks (yes, with 4kB sectors)
	(4 disks on the onboard Intel SATA controller using ahci driver,
	 8 disks on the PIKE using the mps driver)

What experiments do you think I should try?

I note that during the large copies from NFS to ZFS, the "inactive"
page list takes all the spare memory, starving the ARC, which drops to its minimum size.  During make world and my junk creation tests the ARC remained full size.  Could there be a bug in the ARC shrinking code?

I also note that -current spits out:
	kernel: log_sysevent: type 19 is not implemented instead of what 8.2.0 produces:
	root: ZFS: checksum mismatch, zpool=dread path=/dev/gpt/bay14 offset=766747611136 size=4096

I have added some code to cddl/compat/opensolaris/kern/opensolaris_sysevent.c
to print NVLIST elements (type 19) and hope to see the results at the end of the next run.

BTW, does /etc/devd.conf need tweaking now?  If ZFSv28 produces different format error messages they may not be logged.  Indeed, I have added a printf in log_sysevent() because I can't (yet) make devd do what I want.

Also, -current produces many scary lock order reversals.  Are we still ignoring these?

Here's the pool layout:

# zpool status
  pool: dread
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scan: scrub in progress since Tue Mar  8 15:14:51 2011
    5.66T scanned out of 8.49T at 402M/s, 2h2m to go
    92K repaired, 66.71% done
config:

        NAME           STATE     READ WRITE CKSUM
        dread          ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            gpt/bay3   ONLINE       0     0     0
            gpt/bay4   ONLINE       0     0     6  (repairing)
            gpt/bay5   ONLINE       0     0     0
            gpt/bay6   ONLINE       0     0     0
            gpt/bay7   ONLINE       0     0     0
            gpt/bay8   ONLINE       0     0     0
          raidz2-1     ONLINE       0     0     0
            gpt/bay9   ONLINE       0     0     1  (repairing)
            gpt/bay10  ONLINE       0     0     6  (repairing)
            gpt/bay11  ONLINE       0     0     2  (repairing)
            gpt/bay12  ONLINE       0     0     0
            gpt/bay13  ONLINE       0     0     8  (repairing)
            gpt/bay14  ONLINE       0     0     0

errors: No known data errors

Bay3 through 6 are on the onboard controller.  Bay7 through 14 are on the PIKE card.

Each disk is partitioned alike:

# gpart show ada2   
=>        34  3907029101  ada2  GPT  (1.8T)
          34          94        - free -  (47K)
         128         128     1  freebsd-boot  (64K)
         256  3906994176     2  freebsd-zfs  (1.8T)
  3906994432       34703        - free -  (17M)

I used well known tricks to fool ZFS into using ashift=12 to align for lying 4kB sector drives.

The next run will take NFS out of the equation (substituting SSH as a transport).  Any ideas on what I could try after that?

Stephen McKay.

PS Anybody got a mirror of http://www.sun.com/msg/ZFS-8000-9P and similar pages?  Oracle has hidden them all, so it's a bit silly to refer to them in our ZFS implementation.
_______________________________________________
freebsd-fs at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"