Re: Sudden zpool checksums errors
- In reply to: Andrea Venturoli : "Sudden zpool checksums errors"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 05 Apr 2025 10:40:16 UTC
On 04/04/2025 16:42, Andrea Venturoli wrote: > Hello. > > I've got a box with two zpools: > _ 1 mirror on 2 SSDs; > _ 1 raidz1 on 12 HDDs. > > Suddenly one daily run showed the following: >> pool: backup >> state: ONLINE >> status: One or more devices has experienced an unrecoverable error. An >> attempt was made to correct the error. Applications are unaffected. >> action: Determine if the device needs to be replaced, and clear the >> errors >> using 'zpool clear' or replace the device with 'zpool replace'. >> see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P >> scan: scrub repaired 3.18M in 16:53:16 with 0 errors on Tue Apr 1 >> 20:16:55 2025 >> config: >> >> NAME STATE READ WRITE CKSUM >> backup ONLINE 0 0 0 >> raidz1-0 ONLINE 0 0 0 >> da4 ONLINE 0 0 0 >> da10 ONLINE 0 0 0 >> da5 ONLINE 0 0 57 >> da2 ONLINE 0 0 0 >> da8 ONLINE 0 0 25 >> da0 ONLINE 0 0 0 >> da1 ONLINE 0 0 49 >> da12 ONLINE 0 0 8 >> da6 ONLINE 0 0 6 >> da11 ONLINE 0 0 0 >> da9 ONLINE 0 0 56 >> da13 ONLINE 0 0 73 >> >> errors: No known data errors > > Assuming you've checked the logs etc as you say I'd be suspicious of the HBA and cabling, and presumably a SAS expander. But IME it's well worth testing the drives. Just dd them to /dev/null and see if anything sqwalks. There's nothing stopping you doing this on a live ZFS pool, although maybe do them one at a time if the array is busy :-) Given the nature of SCSI you may find the only indication that a drive isn't 100% is an unusually slow read rate. I agree it would be a coincidence 50% of the drives were flaky but it does happen, or it might be there're on one flaky HBA connecting half of them. I can't help being drawn to the fact it's exactly half that are throwing errors. Anyway, checking the drives out by reading is minimal effort before diving into more esoteric reasons. ZFS isn't as good as people think about detecting failing drives until they're actually on fire (see my posts passim on this matter). Regards, Frank.