Checksum errors across ZFS array

Dr Joe Karthauser joe at tao.org.uk
Thu Jul 19 17:12:54 UTC 2012


Hi James,

It's almost definitely a memory problem. I'd change it ASAP if I were you.

I lost about 70mb from my zfs pool for this very reason just a few weeks ago. Luckily I had enough snapshots from before the rot set in to recover most of what I lost.

Joe

-- 
Dr Joe Karthauser

On 19 Jul 2012, at 16:29, James Snow <snow at teardrop.org> wrote:

> I have a ZFS server on which I've seen periodic checksum errors on
> almost every drive. While scrubbing the pool last night, it began to
> report unrecoverable data errors on a single file.
> 
> I compared an md5 of the supposedly corrupted file to an md5 of the
> original copy, stored on different media. They were the same, suggesting
> no corruption.
> 
> A large file was being written to the pool while the scrub was in
> progress, and the entire array became unresponsive. The OS was still up,
> but 'zpool status' showed the scrub progress stuck at the same spot,
> with the throughput rate falling. 'shutdown -r now' stalled. Eventually
> I hard power cycled the system.
> 
> Now, attempting to read the file that ZFS reports errors on yields
> "Input/output error." The scrub completed, with the following result:
> 
>        NAME         STATE     READ WRITE CKSUM
>        tank         ONLINE       0     0     7
>          mirror-0   ONLINE       0     0     0
>            aacd0p1  ONLINE       0     0     0
>            aacd4p1  ONLINE       0     0     1
>          mirror-1   ONLINE       0     0     0
>            aacd1p1  ONLINE       0     0     0
>            aacd5p1  ONLINE       0     0     0
>          mirror-2   ONLINE       0     0    14
>            aacd2p1  ONLINE       0     0    14
>            aacd6p1  ONLINE       0     0    14
>          mirror-3   ONLINE       0     0     0
>            aacd3p1  ONLINE       0     0     0
>            aacd7p1  ONLINE       0     0     0
> 
> The system configuration is as follows:
> 
> Controller:  Adaptec 2805 
> Motherboard: Supermicro X8STE
> Drive Cage:  2x Supermicro CSE-M35T-1
> Memory:      2x Kingston 12GB ECC (KVR1066D3E7SK3/12G)
> PSU:         Nexus RX-7000
> OS:          9.0-RELEASE-p3
> ZFS:         ZFS filesystem version 5, ZFS storage pool version 28
> 
> 
> The Adaptec card has 2 ports, each of which uses a 4-port fan-out cable.
> The cables are routed as shown:
> 
>      /--- aacd0 (ST1000DM003-9YN1 CC4D)
>     / /-- aacd1 (ST1000DM003-9YN1 CC4D)
> p1-----
>     \ \-- aacd2 (WDC WD1001FALS-0 05.0)
>      \--- aacd3 (WDC WD1001FALS-0 05.0)
> 
>      /--- aacd4 (ST1000DM003-9YN1 CC4D)
>     / /-- aacd5 (ST1000DM003-9YN1 CC4D)
> p2-----
>     \ \-- aacd6 (WDC WD1002FAEX-0 05.0)
>      \--- aacd7 (WDC WD1002FAEX-0 05.0)
> 
> You can see that each ZFS mirror device is comprised of one drive from
> each drive carrier, on separate ports, on separate cables.
> 
> Since I have seen periodic checksum errors on almost every drive but the
> only common component is the Adapter controller and the motherboard, I
> suspect the controller. (Or the motherboard, but I'm starting with the
> controller since it's much simpler to swap out.)
> 
> Could it be something else? What else I should be looking at? Any input
> greatly appreciated.
> 
> 
> -Snow
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
> 


More information about the freebsd-stable mailing list