Re: Sudden zpool checksums errors

From: Andrea Venturoli <ml_at_netfence.it>
Date: Sat, 05 Apr 2025 09:01:15 UTC
On 4/4/25 20:59, Dave Cottlehuber wrote:


Thanks to all.
I'll answer here collectively.





> I have had marginal power supplies, backplane issues or break out cables from the controller manifest errors like that.  I would check the power supply first, backplane next, controller 3rd.

How would I go about this? How do I check these components?
Does IPMI provide something useful?





> If its memory, and your mainboard supports it, you'll see failures in dmesg,
> MCA ... some good examples:

No such things.
Either the MB does not support it (is it possible? likely?) or it's not RAM.

> Look for SCSI or CAM errors in your logs too, disconnects.

No such thing either.

> - overclocking

No overclocking.

> - overheating on mainboard, or controller, or drives

I monitor temperature with Nagios and received no alarm.

> - actually really bad ECC memory

Any way to test?

> - drive cables that have worked loose over time

Server is quite new (not even an year), but I can eventually check.

> External vibrations can cause problems.

This is possible, since the building is being expanded and construction 
of a new block is underway.
However, there are four servers which still have hard disks and only 
this one showed the problem.

> A slow process of upgrading firmware

I checked on Toshiba website and found no download; I'll eventually 
check with the supplier.

Is there a way I can check the controller firmware version via software?
I mean in FreeBSD, without rebooting?
dmesg.boot doesn't say.





> does ipmitool sel list show anything btw ? (kldload ipmi and pkg install ipmitools if you dont have it already) 

> # ipmitool sel list
>    1 | 05/06/24 | 18:16:23 CEST | Temperature #0xcc | Upper Non-critical going high | Asserted
>    2 | 05/06/24 | 21:25:42 CEST | Temperature #0xcc | Upper Critical going high | Asserted
>    3 | 05/07/24 | 15:49:00 CEST | Temperature #0xcc | Upper Critical going high | Deasserted
>    4 | 05/07/24 | 16:00:43 CEST | Temperature #0xcc | Upper Non-critical going high | Deasserted
>    5 | 06/13/24 | 11:54:52 CEST | Drive Slot / Bay #0x77 | Drive Present | Asserted
>    6 | 06/13/24 | 11:55:24 CEST | Drive Slot / Bay #0x73 | Drive Present | Asserted
>    7 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x73 | Drive Present | Deasserted
>    8 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x77 | Drive Present | Deasserted

Logs are from May/June, but the problem I'm talking about appeared some 
days ago, so it's not related.



  bye & Thanks
	av.