Re: Sudden zpool checksums errors
- Reply: Dave Cottlehuber: "Re: Sudden zpool checksums errors"
- Reply: David Christensen : "Re: Sudden zpool checksums errors"
- Reply: mike tancsa : "Re: Sudden zpool checksums errors"
- In reply to: Dave Cottlehuber: "Re: Sudden zpool checksums errors"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 05 Apr 2025 09:01:15 UTC
On 4/4/25 20:59, Dave Cottlehuber wrote: Thanks to all. I'll answer here collectively. > I have had marginal power supplies, backplane issues or break out cables from the controller manifest errors like that. I would check the power supply first, backplane next, controller 3rd. How would I go about this? How do I check these components? Does IPMI provide something useful? > If its memory, and your mainboard supports it, you'll see failures in dmesg, > MCA ... some good examples: No such things. Either the MB does not support it (is it possible? likely?) or it's not RAM. > Look for SCSI or CAM errors in your logs too, disconnects. No such thing either. > - overclocking No overclocking. > - overheating on mainboard, or controller, or drives I monitor temperature with Nagios and received no alarm. > - actually really bad ECC memory Any way to test? > - drive cables that have worked loose over time Server is quite new (not even an year), but I can eventually check. > External vibrations can cause problems. This is possible, since the building is being expanded and construction of a new block is underway. However, there are four servers which still have hard disks and only this one showed the problem. > A slow process of upgrading firmware I checked on Toshiba website and found no download; I'll eventually check with the supplier. Is there a way I can check the controller firmware version via software? I mean in FreeBSD, without rebooting? dmesg.boot doesn't say. > does ipmitool sel list show anything btw ? (kldload ipmi and pkg install ipmitools if you dont have it already) > # ipmitool sel list > 1 | 05/06/24 | 18:16:23 CEST | Temperature #0xcc | Upper Non-critical going high | Asserted > 2 | 05/06/24 | 21:25:42 CEST | Temperature #0xcc | Upper Critical going high | Asserted > 3 | 05/07/24 | 15:49:00 CEST | Temperature #0xcc | Upper Critical going high | Deasserted > 4 | 05/07/24 | 16:00:43 CEST | Temperature #0xcc | Upper Non-critical going high | Deasserted > 5 | 06/13/24 | 11:54:52 CEST | Drive Slot / Bay #0x77 | Drive Present | Asserted > 6 | 06/13/24 | 11:55:24 CEST | Drive Slot / Bay #0x73 | Drive Present | Asserted > 7 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x73 | Drive Present | Deasserted > 8 | 06/13/24 | 14:21:04 CEST | Drive Slot / Bay #0x77 | Drive Present | Deasserted Logs are from May/June, but the problem I'm talking about appeared some days ago, so it's not related. bye & Thanks av.