ZFS: 'checksum mismatch' all over the place

Sat Aug 18 03:27:09 PDT 2007

Hello.

We've just put a 12x750 GB raidz2 pool into use, but we're seeing
constant 'checksum mismatch' errors. The drives are brand new.

'zpool status' currently lists the following:

        NAME        STATE     READ WRITE CKSUM
        pil         ONLINE       0     0 189.9
          raidz2    ONLINE       0     0 189.9
            da0     ONLINE       0     0 2.99K
            da1     ONLINE       0     0   606
            da2     ONLINE       0     0    75
            da3     ONLINE       0     0 1.94K
            da4     ONLINE       0     0   786
            da5     ONLINE       0     0    88
            da6     ONLINE       0     0    79
            da7     ONLINE       0     0    99
            da8     ONLINE       0     0   533
            da9     ONLINE       0     0 1.38K
            da10    ONLINE       0     0    15
            da11    ONLINE       0     0   628

da0-da11 are really logical drives on an EonStor SCSI drive-cage. The
physical disks are SATA, but since our EonStor can't run in JBOD-mode,
I've had to create a logical drive per physical drive, and map each onto
a separate SCSI LUN.

The drive-cage was previously used to expose a RAID-5 array, composed of
the 12 disks. This worked just fine, connecting to the same machine and
controller (i386 IBM xSeries X335, mpt(4) controller).

The EonStor can report SMART-statistics on each SATA-drive, and
everything looks peachy there.

What puzzles me is, that the drives don't seem to be failing - they just
develop checksum errors. If they had hard failures, ZFS should mark them
broken. It's also spread across all disks, and I have a hard time
believing we just got 12 bad drives, which don't register as bad to the
EonStor.

Has anybody seen something like this? Any pointers on how to debug it?

-- 
Kenneth Schmidt
pil.dk