ECC support

Bob Bishop rb at gid.co.uk
Thu Oct 22 18:57:41 UTC 2015


HI,

> On 22 Oct 2015, at 19:09, John Baldwin <jhb at freebsd.org> wrote:
> 
> On Wednesday, September 16, 2015 10:56:52 AM Dieter BSD wrote:
>> Chris:
>>> MCA: Bank 1, Status 0x9400000000000151
>>> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
>>> MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
>>> 
>>> MCA: Address 0x81cc0e9f0
>>> 
>>> Kind of freaky. I've never had this error on this board before.
>>> On others tho.
>>> 
>>> Try a search for MCA instead.
>> 
>> Is there a decoder ring for those messages?  I don't recall seeing
>> messages like that, although I wasn't looking for them, and they
>> don't leap out at you screaming ERROR! ERROR!  Digital Unix had its
>> problems, but at least the error messages were fairly clear.
>> Something like "single bit memory error at address 0x12345..."
>> A simple edit to sys/x86/x86/mca.c
>>   s/printf("UNCOR ");/printf("Uncorrectable ");/
>>   s/printf("COR ");/printf("Correctable ");/
>> would make the messages at least slightly more meaningful to a viewer
>> who isn't intimently(sp) familiar with the mca.  Which most people aren't.
> 
> The problem is that there are other fields to decode and you can only fit so
> much in one line.  Also, there is not a CPU-independent way to know the
> address of an ECC error. [etc]

On server-class hardware, the platform management (BMC or whatever) is probably decoding this stuff for event logs and can be interrogated via IPMI (or whatever).

--
Bob Bishop
rb at gid.co.uk






More information about the freebsd-hackers mailing list