MCA messages after upgrade to 8.2-BEAT1

Tue Dec 28 16:44:23 UTC 2010

On Friday, December 24, 2010 3:47:16 am Matthew D. Fuller wrote:
> On Wed, Dec 22, 2010 at 09:57:26AM -0500 I heard the voice of
> John Baldwin, and lo! it spake thus:
> > 
> > You are getting corrected ECC errors in your RAM.
> 
> Actually, don't
> 
> > CPU 0 0 data cache 
> > ADDR 236493c0 
> >   Data cache ECC error (syndrome 1c)
> 
> > CPU 0 1 instruction cache 
> > ADDR 2a1c9440 
> >   Instruction cache ECC error
> 
> > CPU 0 2 bus unit 
> >   L2 cache ECC error
> 
> > CPU 1 0 data cache 
> > ADDR 23649640 
> >   Data cache ECC error (syndrome 1c)
> 
> > CPU 1 1 instruction cache 
> > ADDR 2a1c9440 
> >   Instruction cache ECC error
> 
> > CPU 1 2 bus unit 
> >   L2 cache ECC error
> 
> suggest CPU cache, not RAM?
> 
> (that's actually a question; I don't know, but that's what a naive
> reading suggests...)

Hmm, I don't know for certain.  My interpretation is that the CPU errors were 
just secondary errors from a memory error like this one that was in the middle 
of his reported errors.  It was also only reported on CPU 0 and not CPU 1:

STATUS d000400000000863 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is NOT a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge 
MISC e00d0fff00000000 ADDR 2cac9678 
  Northbridge RAM ECC error
  ECC syndrome = 1c
       bit33 = err cpu1
       bit46 = corrected ecc error
       bit59 = misc error valid
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             generic read mem transaction
             memory access, level generic'

On Intel systems (which I am much more familiar with as far as machine checks 
go), corrected ECC errors did not result in additional events in the CPU 
caches themselves, but I don't know if AMD is different in this regard.  It 
could be that both CPUs and a DIMM are failing, but replacing a DIMM is 
cheaper and simpler and you can always replace the CPUs later if CPU errors 
continue.  Of course, I can't tell you which DIMM to replace from these 
messages, but in this case since they are so easily reproducible, you could 
probably swap them out one at a time to test.

-- 
John Baldwin