MCA: CPU 0 UNCOR PCC DTLB L1 error

Mon May 16 16:51:27 UTC 2011

On Mon, May 16, 2011 at 06:23:19PM +0200, John Hay wrote:
> On Wed, May 11, 2011 at 05:26:50PM -0500, Alan Cox wrote:
> > On Tue, May 10, 2011 at 7:52 AM, John Hay <jhay at meraka.org.za> wrote:
> > 
> > > Hi,
> > >
> > > I have seen this panic a few times on a Gigabyte E350N-USB3 running
> > > 8-STABLE.
> > > I have only seen it while in X, but then the machine is always in X. At
> > > first,
> > > I just got these hangs, so bought a PCI-express RS232 card and could see
> > > these
> > > at last. For some reason it does not go past this, so I have not been able
> > > to
> > > get a dump yet.
> > >
> > > Have anybody an idea of why this is or how to debug it further? I searched
> > > the archives and found something similar about a year ago, but it looks
> > > like it was solved with a fix that got committed.
> > >
> > > http://www.freebsd.org/cgi/query-pr.cgi?pr=140338
> > >
> > > I have now disabled mca in loader.conf with 'hw.mca.enabled="0"' and I have
> > > not seen that panic again. I do occasionally see a panic in devfs_open(),
> > > but I guess that should be handled in another thread.
> > >
> > > The kernel is basically a GENERIC kernel with puc uncommented and the
> > > following in loader.conf
> > >
> > > vm.kmem_size="12G"
> > > hw.mca.enabled="0"
> > > zfs_load="YES"
> > > ahci_load="YES"
> > > xhci_load="YES"
> > > amdtemp_load="YES"
> > > ng_ubt_load="YES"
> > > uplcom_load="YES"
> > >
> > > Here is the panic message and after that dmesg.
> > >
> > > John
> > > --
> > > John Hay -- jhay at meraka.csir.co.za / jhay at FreeBSD.org
> > >
> > > ####################################################
> > > MCA: Bank 0, Status 0xb600000000010015
> > > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
> > > MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
> > > MCA: CPU 0 UNCOR PCC DTLB L1 error
> > > MCA: Address 0x8016c4000
> > >
> > >
> > > Fatal trap 28: machine check trap while in user mode
> > > cpuid = 0; apic id = 00
> > > instruction pointer     = 0x43:0x80156af85
> > > stack pointer           = 0x3b:0x7fffffffcb18
> > > frame pointer           = 0x3b:0x80fe87800
> > > code segment            = base 0x0, limit 0xfffff, type 0x1b
> > >                        = DPL 3, pres 1, long 1, def32 0, gran 1
> > > processor eflags        = interrupt enabled, IOPL = 0
> > > current process         = 2484 (initial thread)
> > > trap number             = 28
> > > panic: machine check trap
> > > cpuid = 0
> > > KDB: stack backtrace:
> > > #0 0xffffffff80608d5e at kdb_backtrace+0x5e
> > > #1 0xffffffff805d6707 at panic+0x187
> > > #2 0xffffffff808bf4c0 at trap_fatal+0x290
> > > #3 0xffffffff808bfaa9 at trap+0x109
> > > #4 0xffffffff808a7d94 at calltrap+0x8
> > > ####################################################
> > >
> > >
> > Please try the following patch:
> > 
> > Index: x86/x86/mca.c
> > ===================================================================
> > --- x86/x86/mca.c       (revision 219060)
> > +++ x86/x86/mca.c       (working copy)
> > @@ -665,7 +665,8 @@ mca_setup(uint64_t mcg_cap)
> >          * for Erratum 383.
> >          */
> >         if (cpu_vendor_id == CPU_VENDOR_AMD &&
> > -           CPUID_TO_FAMILY(cpu_id) == 0x10 && amd10h_L1TP)
> > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > +           CPUID_TO_FAMILY(cpu_id) == 0x14) && amd10h_L1TP)
> >                 workaround_erratum383 = 1;
> > 
> >         mtx_init(&mca_lock, "mca", NULL, MTX_SPIN);
> > Index: i386/i386/pmap.c
> > ===================================================================
> > --- i386/i386/pmap.c    (revision 219060)
> > +++ i386/i386/pmap.c    (working copy)
> > @@ -758,7 +758,8 @@ pmap_init(void)
> >          * machine monitor.
> >          */
> >         if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
> > -           CPUID_TO_FAMILY(cpu_id) == 0x10)
> > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > +           CPUID_TO_FAMILY(cpu_id) == 0x14))
> >                 workaround_erratum383 = 1;
> > 
> >         /*
> > Index: amd64/amd64/pmap.c
> > ===================================================================
> > --- amd64/amd64/pmap.c  (revision 219060)
> > +++ amd64/amd64/pmap.c  (working copy)
> > @@ -727,7 +727,8 @@ pmap_init(void)
> >          * machine monitor.
> >          */
> >         if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
> > -           CPUID_TO_FAMILY(cpu_id) == 0x10)
> > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > +           CPUID_TO_FAMILY(cpu_id) == 0x14))
> >                 workaround_erratum383 = 1;
> > 
> >         /*
> 
> I have applied the patch, but got another one today. I still do not get
> a prompt or dump. :-( It just get stuck right after #4. If there is anything
> more that I can try, just ask.
> 
> #####################################################################
> MCA: Bank 0, Status 0xb600000000010015
> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
> MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
> MCA: CPU 0 UNCOR PCC DTLB L1 error
> MCA: Address 0x808ace000
> 
> 
> Fatal trap 28: machine check trap while in user mode
> cpuid = 1; apic id = 01
> instruction pointer	= 0x43:0x80af206d5
> stack pointer	        = 0x3b:0x7fffffffb8e8
> frame pointer	        = 0x3b:0x809b92450
> code segment		= base 0x0, limit 0xfffff, type 0x1b
> 			= DPL 3, pres 1, long 1, def32 0, gran 1
> processor eflags	= interrupt enabled, IOPL = 0
> current process		= 22228 (initial thread)
> trap number		= 28
> panic: machine check trap
> cpuid = 1
> KDB: stack backtrace:
> #0 0xffffffff80608f6e at kdb_backtrace+0x5e
> #1 0xffffffff805d6917 at panic+0x187
> #2 0xffffffff808bf7c0 at trap_fatal+0x290
> #3 0xffffffff808bfda9 at trap+0x109
> #4 0xffffffff808a8084 at calltrap+0x8
> #####################################################################

The backtrace doesn't help in this situation.  I'm not sure anyone has
taken the time to explain to you what's going on here exactly.  I don't
know if you're like me, but when a machine panics I generally like to
know what's going on.  :-)

Use of MCA (see Wikipedia for Machine Check Architecture) is generating
an MCE (see Wikipedia for Machine Check Exception).  MCEs are generated
by hardware when "something happens" -- they usually indicate a
failure (bad RAM, CPU cache failing, etc.).

Certain MCEs are considered "normal"; for example, L2 cache (on-die in
the CPU) being auto-corrected by ECC (that's ECC on-die, not ECC RAM
like system RAM; this feature is only available on certain classes of
CPUs) may be normal if seen, say, once every few months.  A large sum of
them, however, is not normal.

MCE handling is done in the kernel.  Certain MCEs have to be ignored,
and therefore there are handlers for those in the kernel.

MCEs vary greatly per every model (not class, but model) of CPU.  For
example, Intel's documentation on their MCEs is immense and very complex
given all the different CPU models and series'.

Any MCE without a handler will generate an exception (kernel panic) like
what you see above.  This is normal on FreeBSD, as well as Solaris and
many other OSes.  It's basically mandatory.  The reason being, if the
situation/condition isn't known to be something that can be ignored, the
hardware may be in a state of disarray and cannot be trusted.  Hence,
panic.  The backtrace will therefore always be very short and indicate
an intentional panic.

The MCE messages shown in FreeBSD are not very user-friendly, meaning
you can't take what you see and go "omg!!! L1 cache failure!!" because
that's not necessarily what that message means.  MCA is complex, and
again, like I said, varies per model of CPU.

There is a utility on Linux called mcelog that can decode the messages
to some degree.  John Baldwin ported this to FreeBSD (it's not in ports)
and I've been occasionally downloading it and ensuring the patches work
correctly + utility compiles and works (I have patches for patches,
basically; no I haven't put them up anywhere).  "mcelog --ascii" will
read data from stdin, specifically the messages you see from the kernel,
and it outputs something a little more friendly.

In your case, however, mcelog does not have support for your specific
model of CPU.  Possibly too new?  Here's the output that is returned:

$ ./mcelog --no-dmi --ascii
MCA: Bank 0, Status 0xb600000000010015
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
MCA: CPU 0 UNCOR PCC DTLB L1 error
MCA: Address 0x808ace000

mcelog: Unknown CPU type vendor 2 family 14 model 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 0
ADDR 808ace000
STATUS b600000000010015 MCGSTATUS 4
MCGCAP 106 APICID 0 SOCKETID 0
CPUID Vendor AMD Family 20 Model 1

I'm not familiar with AMD CPUs so I can't really look up what's going on
here or what the MCE indicates, but this information may help others on
this list.

A workaround -- though risky -- may be to disable MCA entirely by
setting hw.mca.enabled="0" in /boot/loader.conf and rebooting.  This
will ensure your system won't panic whenever *any* MCE is seen.  Older
FreeBSD defaulted to MCA being off.  However, since I don't know what
the MCE indicates, it could be fatal (e.g. panic'ing might be a better
choice).  Hard to say at this point.

Hope this helps educate in one way or another.  :-)

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |