Re: Sudden zpool checksums errors

From: mike tancsa <mike_at_sentex.net>
Date: Mon, 07 Apr 2025 13:07:11 UTC
On 4/5/2025 5:01 AM, Andrea Venturoli wrote:
> On 4/4/25 20:59, Dave Cottlehuber wrote:
>
>
> Thanks to all.
> I'll answer here collectively.
>
>
>
>
>
>> I have had marginal power supplies, backplane issues or break out 
>> cables from the controller manifest errors like that.  I would check 
>> the power supply first, backplane next, controller 3rd.
>
> How would I go about this? How do I check these components?
> Does IPMI provide something useful?
>
ipmitool sensors. The ipmitool sel list  will tell you actual errors 
logged.  What does the smartctl -a /dev/da# show for the temperatures of 
the hard drives ?  Does smartctl -x show any interesting log entries for 
the drives that threw errors vs the ones that did not ?

>> - actually really bad ECC memory
>
> Any way to test?
>
memtest will help a bit.  But if its ECC errors typically do get logged 
by the BMC and the ipmitool sel list will typically log those.

>
>
>> does ipmitool sel list show anything btw ? (kldload ipmi and pkg 
>> install ipmitools if you dont have it already) 
>
>> # ipmitool sel list
>>    1 | 05/06/24 | 18:16:23 CEST | Temperature #0xcc | Upper 
>> Non-critical going high | Asserted
>>    2 | 05/06/24 | 21:25:42 CEST | Temperature #0xcc | Upper Critical 
>> going high | Asserted
>>    3 | 05/07/24 | 15:49:00 CEST | Temperature #0xcc | Upper Critical 
>> going high | Deasserted
>>    4 | 05/07/24 | 16:00:43 CEST | Temperature #0xcc | Upper 
>> Non-critical going high | Deasserted
>>    5 | 06/13/24 | 11:54:52 CEST | Drive Slot / Bay #0x77 | Drive 
>> Present | Asserted
>>    6 | 06/13/24 | 11:55:24 CEST | Drive Slot / Bay #0x73 | Drive 
>> Present | Asserted
>>    7 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x73 | Drive 
>> Present | Deasserted
>>    8 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x77 | Drive 
>> Present | Deasserted
>
>