Re: Sudden zpool checksums errors

From: Dave Cottlehuber <dch_at_skunkwerks.at>
Date: Fri, 04 Apr 2025 18:59:35 UTC
On Fri, 4 Apr 2025, at 15:42, Andrea Venturoli wrote:
> Hello.
> I'm finding it hard to believe that 7 disks out of 12 are failing or 
> just happened to misbehave all on the same day.
> BTW, SMART says they are OK.

Not saying its not zfs, but its probably not zfs.... fingers crossed!

> I'm reluctant to blame RAM (since it's ECC) and power supply (as it's 
> redundant 2x800W).

If its memory, and your mainboard supports it, you'll see failures in dmesg,
MCA ... some good examples:

https://lists.freebsd.org/pipermail/freebsd-hackers/2015-January/046878.html
https://forums.freebsd.org/threads/mca-errors.88909/
https://forums.freebsd.org/threads/solved-weird-mca-errors.94800/

> Disks are 16TB TOSHIBA MG09ACA1 connected to a MegaRAID SAS-3 3108 (of 
> course not operating as RAID and with mrsas driver).

Look for SCSI or CAM errors in your logs too, disconnects.
 
I have seen storms of checksum errors in at least these situations:

- faulty or failing storage / scsi controller
- insufficient power (or failing power supplies) under load
- overclocking
- overheating on mainboard, or controller, or drives
- actually really bad ECC memory
- drive cables that have worked loose over time
- over 50 disks failing within 2 days in a 200+ disk array
- all disks failing within 20 days of deployment in 24 disk chassis

Sometimes, vendors produce batches of Bad Disks - firmware bugs, physical
defects, unexpected dust inside the sealed platters. Failures are far more
correlated than you'd want to believe. External vibrations can cause
problems.

A slow process of upgrading firmware & checking each component, resetting
all cables, is the best way to deal with this.

A+
Dave