ECC support

John Baldwin jhb at freebsd.org
Thu Oct 22 18:14:00 UTC 2015


On Wednesday, September 16, 2015 10:56:52 AM Dieter BSD wrote:
> Chris:
> > MCA: Bank 1, Status 0x9400000000000151
> > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
> > MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
> >
> > MCA: Address 0x81cc0e9f0
> >
> > Kind of freaky. I've never had this error on this board before.
> > On others tho.
> >
> > Try a search for MCA instead.
> 
> Is there a decoder ring for those messages?  I don't recall seeing
> messages like that, although I wasn't looking for them, and they
> don't leap out at you screaming ERROR! ERROR!  Digital Unix had its
> problems, but at least the error messages were fairly clear.
> Something like "single bit memory error at address 0x12345..."
> A simple edit to sys/x86/x86/mca.c
>    s/printf("UNCOR ");/printf("Uncorrectable ");/
>    s/printf("COR ");/printf("Correctable ");/
> would make the messages at least slightly more meaningful to a viewer
> who isn't intimently(sp) familiar with the mca.  Which most people aren't.

The problem is that there are other fields to decode and you can only fit so
much in one line.  Also, there is not a CPU-independent way to know the
address of an ECC error.  On Intel Core i3/5/7 (anything with QPI) you can
identify the individual DIMM at least, but the label that the motherboard
manufacturer uses varies by manufacturer.  (You can maybe scrape that text
from the SMBIOS tables, but only if they aren't wrong which they sometimes
are, and good luck knowing if they are wrong or right.)  Digital UNIX had the
luxury of running on hardware built by the same company, not on a random
assortment of boards built by various vendors.  FreeBSD does not.

sysutils/mcelog does some more verbose decoding of MCA records, but I find
it to be equally gibberish for anyone not intimately familiar with a specific
CPU.

I wrote a tool for a previous employer that was able to do some simple parsing
of MCA errors for Supermicro X7-X10 boards (Intel CPUs) and give a short
summary that was used in a nagios check.  However, it only handles a narrow
set of systems.

https://github.com/freebsd/freebsd/compare/master...bsdjhb:ecc

-- 
John Baldwin


More information about the freebsd-hackers mailing list