Re: Sudden zpool checksums errors

From: David Christensen <dpchrist_at_holgerdanske.com>
Date: Mon, 07 Apr 2025 19:32:39 UTC
On 4/7/25 08:15, Andrea Venturoli wrote:
> On 4/7/25 15:07, mike tancsa wrote:
> 
>> What does the smartctl -a /dev/da# show for the temperatures of the 
>> hard drives ? 
> 
> Temperatures vary between drives (probably due to their slot position in 
> the chassis): over the last month, the coldest one averaged 30C with a 
> max of 35C; the hottest averaged 39C, with a peak of 48C.
> There does not seem to be a correlation between temperatures and errors 
> (some drives gave errors are colder than others that didn't).


I have been running Seagate 3 TB Barracuda and Constellation drives for 
several years.  When the drive temperatures get above approximately 40 
C, `zpool status` and/or `smartctl -x` start showing internal drive 
errors.  Fixes include adding fans, increasing fan RPM, and removing/ 
rearranging drives to improve airflow.  There is no substitute for good 
cooling.


>> Does smartctl -x show any interesting log entries for the drives that 
>> threw errors vs the ones that did not ?
> 
> All "non-error" drives report:
> SCT Error Recovery Control:
>             Read: Disabled
>            Write: Disabled
> 
> All "error" drives report:
> SCT Error Recovery Control:
>             Read:    655 (65.5 seconds)
>            Write:    670 (67.0 seconds)
> 
> I wonder if this could be the culprit...
> I guess I should enable or disable it on all drives; however I've been 
> reading mixed opinions on whether this is good or bad for ZFS.
> 
> Any suggestion?
> 
> 
> 
> "Errored" drives show a few "Resets Between Cmd Acceptance and 
> Completion", "Number of Hardware Resets", "Number of ASR Events", 
> "Transition from drive PhyRdy to drive PhyNRdy" and "Device-to-host 
> register FISes sent due to a COMRESET".
> 
> Due to my ignorance I cannot tell what might be the cause and what the 
> effect :(


I have played with SCT settings in the past, but the smartctl statistics 
you cite make me think there are external connection problems.  I think 
your best bet at this point is to re-seat all the disk drive related 
expansion cards, power cables, data cables, backplanes, drives, etc.. 
Even gold plated connections degrade over the years.  Vacuum everything, 
especially heat sinks.  While vacuuming, do not allow fans to spin -- 
they can become generators.  If you are using non-locking, 1.5 Gbps, 
and/or 3 Gbps SATA cables, buy and install good quality locking 6Gbps 
SATA cables.  Beware that some red insulation dye can corrode the copper 
conductors inside (I cut open a bad red SATA cable and found corroded 
powder instead of copper metal).  I now buy black SATA cables.  Bundle 
and dress all power cables to facilitate air flow.  Bundle and dress 
signal cables separately.


David