Memory error logged in /var/log/messages

Mon Nov 19 13:38:26 UTC 2018

19.11.2018 20:10, Patrick M. Hausen wrote:

> Hi all,
> 
> one of our production servers, 11.2p3 is logging this every couple of minutes:
> 
> Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error
> Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
> Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
> Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
> Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
> Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0
> 
> Address and core varies but it is always bank 12.
> 
> It seems like applications are unaffected, we use, of course ECC memory.
> 
> Is the OS able to work around these errors and just notifies us or is in-memory
> data already getting corrupted?
> 
> I’m at a bit of a loss identifying which DIMM might be the cause so I contacted Supermicro
> support. They answered:
> 
>> We can't really answer this, we do not know how various OS's map the memory slots.
>> Our advise is always to look at IPMI, but if that doesn't log any issues then we're not sure you're looking at a hardware issue.
>>
>> But assuming the OS looks at the ranks of a module as a bank and you use dual rank memory then it should logically point at DIMMC2.
> 
> They are right on the IPMI (I told them when opening the case) - there’s nothing at all
> in the event log.
> 
> Can they be correct that it might not even be a hardware issue?

Use sysutils/mcelog port (or package) to decode such MCA logs
with "mcelog --no-dmi --ascii" command. For your logs, it reports:

> Hardware event. This is not a software error.
> CPU 0 BANK 12
> MISC 0 ADDR 0
> MCG status:
> MemCtrl: Corrected patrol scrub error
> STATUS cc00010c000800c3 MCGSTATUS 0
> MCGCAP 7000c16 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 79
> (Fields were incomplete)

Seems like hardware memory error corrected with ECC, so no data corruption (yet).
You better replace a module in BANK 12 of CPU 0.