Re: Sudden zpool checksums errors

From: Frank Leonhardt <freebsd-doc_at_fjl.co.uk>
Date: Sat, 05 Apr 2025 10:40:16 UTC
On 04/04/2025 16:42, Andrea Venturoli wrote:
> Hello.
>
> I've got a box with two zpools:
> _ 1 mirror on 2 SSDs;
> _ 1 raidz1 on 12 HDDs.
>
> Suddenly one daily run showed the following:
>>  pool: backup
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>     attempt was made to correct the error.  Applications are unaffected.
>> action: Determine if the device needs to be replaced, and clear the 
>> errors
>>     using 'zpool clear' or replace the device with 'zpool replace'.
>>    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
>>   scan: scrub repaired 3.18M in 16:53:16 with 0 errors on Tue Apr  1 
>> 20:16:55 2025
>> config:
>>
>>     NAME        STATE     READ WRITE CKSUM
>>     backup      ONLINE       0     0     0
>>       raidz1-0  ONLINE       0     0     0
>>         da4     ONLINE       0     0     0
>>         da10    ONLINE       0     0     0
>>         da5     ONLINE       0     0    57
>>         da2     ONLINE       0     0     0
>>         da8     ONLINE       0     0    25
>>         da0     ONLINE       0     0     0
>>         da1     ONLINE       0     0    49
>>         da12    ONLINE       0     0     8
>>         da6     ONLINE       0     0     6
>>         da11    ONLINE       0     0     0
>>         da9     ONLINE       0     0    56
>>         da13    ONLINE       0     0    73
>>
>> errors: No known data errors
>
>
Assuming you've checked the logs etc as you say I'd be suspicious of the 
HBA and cabling, and presumably a SAS expander. But IME it's well worth 
testing the drives. Just dd them to /dev/null and see if anything 
sqwalks. There's nothing stopping you doing this on a live ZFS pool, 
although maybe do them one at a time if the array is busy :-)

Given the nature of SCSI you may find the only indication that a drive 
isn't 100% is an unusually slow read rate.

I agree it would be a coincidence 50% of the drives were flaky but it 
does happen, or it might be there're on one flaky HBA connecting half of 
them. I can't help being drawn to the fact it's exactly half that are 
throwing errors.

Anyway, checking the drives out by reading is minimal effort before 
diving into more esoteric reasons.

ZFS isn't as good as people think about detecting failing drives until 
they're actually on fire (see my posts passim on this matter).

Regards, Frank.