Re: Sudden zpool checksums errors

From: Andrea Venturoli <ml_at_netfence.it>
Date: Mon, 07 Apr 2025 15:15:02 UTC
On 4/7/25 15:07, mike tancsa wrote:

> What does the smartctl -a /dev/da# show for the temperatures of 
> the hard drives ? 

Temperatures vary between drives (probably due to their slot position in 
the chassis): over the last month, the coldest one averaged 30C with a 
max of 35C; the hottest averaged 39C, with a peak of 48C.
There does not seem to be a correlation between temperatures and errors 
(some drives gave errors are colder than others that didn't).



> Does smartctl -x show any interesting log entries for 
> the drives that threw errors vs the ones that did not ?

All "non-error" drives report:
SCT Error Recovery Control:
            Read: Disabled 

           Write: Disabled

All "error" drives report:
SCT Error Recovery Control:
            Read:    655 (65.5 seconds)
           Write:    670 (67.0 seconds)

I wonder if this could be the culprit...
I guess I should enable or disable it on all drives; however I've been 
reading mixed opinions on whether this is good or bad for ZFS.

Any suggestion?



"Errored" drives show a few "Resets Between Cmd Acceptance and 
Completion", "Number of Hardware Resets", "Number of ASR Events", 
"Transition from drive PhyRdy to drive PhyNRdy" and "Device-to-host 
register FISes sent due to a COMRESET".

Due to my ignorance I cannot tell what might be the cause and what the 
effect :(




  bye & Thanks
	av.