Interpreting MCA error output

Sun Oct 2 14:00:06 UTC 2011

On Sun, Oct 02, 2011 at 09:37:43AM +0200, Thomas Zander wrote:
> On Sat, Oct 1, 2011 at 12:23, Jeremy Chadwick <freebsd at jdc.parodius.com> wrote:
> 
> > So what should you do? ?Replace the RAM. ?Which DIMM? ?Sadly I don't
> > know how to determine that. ?Some system BIOSes (particularly on AMD
> > systems I've used) let you do memory tests (similar to memtest86) within
> > the BIOS which can then tell you which DIMM slot experienced a problem.
> > If yours doesn't have that, I would have to say purchase all new RAM
> > (yes, all of it) and test the individual DIMMs later so you can
> > determine which is bad.
> 
> Well, I wasn't too surprised by the panic. I have read somewhere that
> in these situations the kernel might simply panic since the system
> might be in a compromised state. So far so ... well ... acceptable.

IMO, you absolutely want the system to panic where an MCE arrives which
the kernel does not know how to handle gracefully.

There are some MCEs which can be treated as "informational".  A common
example is an MCE that indicates the CPU itself (not system RAM!)
experienced a single-bit ECC L1/L2/L3 parity error (and thus was
correctable).  In the case of Solaris 10, such is reported as
informational.  The kernel in this case also keeps count of how many
times it encounters this MCE for the particular CPU (either core or
physical CPU, depending on if L1/L2/L3 is shared across cores or
dedicated), and if a threshold is reached, it actually takes the CPU
offline.

In the case of FreeBSD (which I do not think has this type of
framework), the administrator has to keep an eye on this type of MCE
over time.  L1/L2/L3 ECC errors are actually normal (think about how
often these caches get used!), but excessive amounts in short periods of
time means it's time to replace the CPU.

Of course, this means that for certain MCEs which are "informational"
(e.g. recoverable), the kernel might panic until code in the kernel
gets written to handle said MCE gracefully.  This applies to all OSes,
naturally, and gets into a cat-and-mouse game when CPU manufacturers
release a new CPU on the market.

Again, the above example is not your situation, but I wanted to provide
an example of something that can be auto-corrected (meaning harmless)
but requires the SA to keep an eye on the system.  Solaris is quite nice
in this regard; fmd (Fault Manager Daemon) and its related framework is
really great for this stuff (look up fmd, fmadm, or fmdump online).
More on a "confusing" MCE momentarily (with an example on Solaris 10).

> My question here is how can I be certain right now if any of the DIMMs
> has gone bad.

You can be absolutely 100% certain.  The MCE is not a "guess" at what's
going on -- literally the hardware reported to the system (either via
NMI or SMI (probably the latter)) the situation.  The MCE really did
happen; it's not fake.

What you *can't* be certain of is that if you were to run, say,
memtest86 or memtest86+, that after an hour or two you'd see some
errors.

So what I'm trying to say is: you definitely have a DIMM that is either
downright bad, or at bare minimum, flaky to the point where it's
suffering from uncorrectable multi-bit errors.  When you will see that
happen is unknown to me, but it's more likely you'll see the situation
happen if you let memtest86/memtest86+ run for a while.

Be aware that in memtest86 (not sure about memtest86+ but probably the
same) you may have to adjust the "Error Report Mode" to show you things
like ECC corrections when they happen.  I *think* by default they're
disabled, I'm not sure.  Search for ECC here:

http://www.memtest86.com/tech.html

> You mentioned problems you have all the time with DIMMs due to bad
> cooling in data centers. My machine in question is not located in a
> data center, that was my home server that tends to have very little
> load. But being located in my apartment, there are lots of _potential_
> problems, including stability of power. In fact this was the first MCA
> event with these DIMMs ever, in more than a year.

Understood.  Let me try to explain what I was getting at:

In actual production datacenters at my workplace we see MCEs which are
indicative of thermal problems with our DCs, and I'd say ~90% of the
time engineers decode these MCEs incorrectly (meaning their reaction is
incorrect for the situation).  Here's an example of one (again, taken
from Solaris 10, with some data XXX'd out given its sensitive nature):

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 21 21:02:33 e1975284-e77c-6c00-d1be-a2e640b12f4a  INTEL-8001-3S  Major

Host        : XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Platform    : S5000PSL  Chassis_id  : XXXXXXXX
Product_sn  :

Fault class : fault.memory.intel.fbd.otf
Problem in  : "MB"
(hc://:product-id=S5000PSL:server-id=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:chassis-id=XXXXXXXX/motherboard=0/memory-controller=0/dram-channel=0)
                  faulted but still in service
FRU         : "MB"
(hc://:product-id=S5000PSL:server-id=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:chassis-id=XXXXXXXX/motherboard=0)
                  faulty

Description : Intelligent throttling is disabled in the memory controller and
              the thermal sensor detected over temperature  Refer to
              http://sun.com/msg/INTEL-8001-3S for more information.

Response    : System panic or reset by BIOS

Impact      : System may be unexpectedly reset

Action      : Enable intelligent throttling in BIOS or supply more cooling

The CPUs in these systems are Intel Xeon L5420s with on-die MCHs, so the
MCH itself is complaining.  The Description means that the memory
controller's internal thermistor or DTS reached excessive thresholds (no
idea what that is; would need to review Intel's CPU documentation),
which means almost certainly there are issues with rack, datacenter, or
chassis cooling.  Our system BIOSes have MCH throttling disabled
intentionally so we can detect these situations, else there would be a
pretty severe performance hit with memory I/O performance, and given
what we do that would have serious repercussions (I'm not exaggerating
either).

My point: most of our engineers misdiagnose this MCE and immediately
think "bad RAM", "bad motherboard", or "bad CPU" and tell our datacenter
guys to replace the system but keep the disks.  The system is fine; it's
the environment/cooling that's a problem.  I can't really talk about the
rest of the ordeal (I'm already on the fence about the above) -- I just
wanted to provide an example of an MCE which when decoded can be a
little tricky to actually understand and how to react to.

> But of course you could be right. A DIMM could be rotten. Absolutely.
> Regarding your suggestion to do memory tests: My BIOS does not support
> testing, so I booted up memtest86+ after reading your e-mail and let
> it run for almost a whole day now. It did not encounter a single
> problem.

Okay, so the error may have been a DIMM soft error that affected
multiple bits (RAM has two kinds of errors, soft and hard; rather not
get into a discussion about that though).  See my above paragraph about
poking at ECC with memtest86 and memtest86+ -- and also be aware the two
programs are very similar but actually, internally, have some very key
differences.  It's usually best to run both of them.  Generally speaking
you don't need to run them for an entire day though (usually 1-2 hours
will iterate over all DRAM on all DIMMs a few times, depending on how
much RAM there is, and errors usually pop up very quickly).

So at this point you can choose to do nothing or you can choose to
replace the DIMMs in advance.  If you choose to do nothing, that's
totally cool -- if you want to wait for it to happen again, that's
absolutely reasonable.  It's entirely up to you.  You could choose to
put up with the MCEs for many years to come too, that's also a
completely valid option!  (I'm not being sarcastic either; for example,
see the section in the FreeBSD Handbook about Backups.  One of the
choices is "Do nothing" -- really!)

> So, even if I bought new DIMMs at once, it might take weeks to figure
> out which DIMM is rotten, if at all. Assuming that MCA events stay
> this infrequent, that is.
> Of course I'll observe the machine closely, but if the rate stays at
> one MCA event per year, it'll take some time to figure out the broken
> DIMM :-)

Excellent.  Like I said, nothing wrong with this decision, and you made
it based on your own conclusions.  This is exactly how to handle this
sort of situation/MCE.  :-)

> > I should really work with John to make mcelog a FreeBSD port and just
> > regularly update it with patches, etc. to work on FreeBSD. ?DMI support
> > and so on I don't think can be added (at least not by me), but simple
> > ASCII decoding? ?Very possible.
> 
> That would be absolutely helpful! After all, FreeBSD is primarily a
> server OS, and where would one have ECC if not on servers. Being able
> to determine what's wrong with memory would be certainly very valuable
> for many admins.

Shortly after my Email, I put together a port for it (sysutils/mcelog).
But before I send-pr it to have it added, I wanted to clear it with
John Baldwin first, since the port involves his patches (which would be
stuck in the ports' files/ directory rather than his web page; the
reason for that is that he could change that patch at any time (and has
in the past), which would suddenly break the port, and I'd rather that
not happen spuriously).  I'll work something out though, don't worry,
as it's a utility we should definitely have in ports.  It's also very
bare-bones (only dependency is gmake, and that wouldn't be required if
the mcelog authors would write decent Makefiles (theirs are awful)).

Pshew, long Email.  :-)  Bed time for me!

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |