Re: Sudden zpool checksums errors
- Reply: mike tancsa : "Re: Sudden zpool checksums errors"
- Reply: Andrea Venturoli : "Re: Sudden zpool checksums errors"
- In reply to: Andrea Venturoli : "Sudden zpool checksums errors"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 04 Apr 2025 18:59:35 UTC
On Fri, 4 Apr 2025, at 15:42, Andrea Venturoli wrote: > Hello. > I'm finding it hard to believe that 7 disks out of 12 are failing or > just happened to misbehave all on the same day. > BTW, SMART says they are OK. Not saying its not zfs, but its probably not zfs.... fingers crossed! > I'm reluctant to blame RAM (since it's ECC) and power supply (as it's > redundant 2x800W). If its memory, and your mainboard supports it, you'll see failures in dmesg, MCA ... some good examples: https://lists.freebsd.org/pipermail/freebsd-hackers/2015-January/046878.html https://forums.freebsd.org/threads/mca-errors.88909/ https://forums.freebsd.org/threads/solved-weird-mca-errors.94800/ > Disks are 16TB TOSHIBA MG09ACA1 connected to a MegaRAID SAS-3 3108 (of > course not operating as RAID and with mrsas driver). Look for SCSI or CAM errors in your logs too, disconnects. I have seen storms of checksum errors in at least these situations: - faulty or failing storage / scsi controller - insufficient power (or failing power supplies) under load - overclocking - overheating on mainboard, or controller, or drives - actually really bad ECC memory - drive cables that have worked loose over time - over 50 disks failing within 2 days in a 200+ disk array - all disks failing within 20 days of deployment in 24 disk chassis Sometimes, vendors produce batches of Bad Disks - firmware bugs, physical defects, unexpected dust inside the sealed platters. Failures are far more correlated than you'd want to believe. External vibrations can cause problems. A slow process of upgrading firmware & checking each component, resetting all cables, is the best way to deal with this. A+ Dave