MCA: CPU 0 UNCOR PCC DTLB L1 error

Mon May 16 17:18:39 UTC 2011

On Mon, May 16, 2011 at 06:23:19PM +0200, John Hay wrote:
> On Wed, May 11, 2011 at 05:26:50PM -0500, Alan Cox wrote:
> > On Tue, May 10, 2011 at 7:52 AM, John Hay <jhay at meraka.org.za> wrote:
> > 
> > > Hi,
> > >
> > > I have seen this panic a few times on a Gigabyte E350N-USB3 running
> > > 8-STABLE.
> > > I have only seen it while in X, but then the machine is always in X. At
> > > first,
> > > I just got these hangs, so bought a PCI-express RS232 card and could see
> > > these
> > > at last. For some reason it does not go past this, so I have not been able
> > > to
> > > get a dump yet.
> > >
> > > Have anybody an idea of why this is or how to debug it further? I searched
> > > the archives and found something similar about a year ago, but it looks
> > > like it was solved with a fix that got committed.
> > >
> > > http://www.freebsd.org/cgi/query-pr.cgi?pr=140338
> > >
> > > I have now disabled mca in loader.conf with 'hw.mca.enabled="0"' and I have
> > > not seen that panic again. I do occasionally see a panic in devfs_open(),
> > > but I guess that should be handled in another thread.
> > >
> > > The kernel is basically a GENERIC kernel with puc uncommented and the
> > > following in loader.conf
> > >
> > > vm.kmem_size="12G"
> > > hw.mca.enabled="0"
> > > zfs_load="YES"
> > > ahci_load="YES"
> > > xhci_load="YES"
> > > amdtemp_load="YES"
> > > ng_ubt_load="YES"
> > > uplcom_load="YES"
> > >
> > > Here is the panic message and after that dmesg.
> > >
> > > John
> > > --
> > > John Hay -- jhay at meraka.csir.co.za / jhay at FreeBSD.org
> > >
> > > ####################################################
> > > MCA: Bank 0, Status 0xb600000000010015
> > > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
> > > MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
> > > MCA: CPU 0 UNCOR PCC DTLB L1 error
> > > MCA: Address 0x8016c4000
> > >
> > >
> > > Fatal trap 28: machine check trap while in user mode
> > > cpuid = 0; apic id = 00
> > > instruction pointer     = 0x43:0x80156af85
> > > stack pointer           = 0x3b:0x7fffffffcb18
> > > frame pointer           = 0x3b:0x80fe87800
> > > code segment            = base 0x0, limit 0xfffff, type 0x1b
> > >                        = DPL 3, pres 1, long 1, def32 0, gran 1
> > > processor eflags        = interrupt enabled, IOPL = 0
> > > current process         = 2484 (initial thread)
> > > trap number             = 28
> > > panic: machine check trap
> > > cpuid = 0
> > > KDB: stack backtrace:
> > > #0 0xffffffff80608d5e at kdb_backtrace+0x5e
> > > #1 0xffffffff805d6707 at panic+0x187
> > > #2 0xffffffff808bf4c0 at trap_fatal+0x290
> > > #3 0xffffffff808bfaa9 at trap+0x109
> > > #4 0xffffffff808a7d94 at calltrap+0x8
> > > ####################################################
> > >
> > >
> > Please try the following patch:
> > 
> > Index: x86/x86/mca.c
> > ===================================================================
> > --- x86/x86/mca.c       (revision 219060)
> > +++ x86/x86/mca.c       (working copy)
> > @@ -665,7 +665,8 @@ mca_setup(uint64_t mcg_cap)
> >          * for Erratum 383.
> >          */
> >         if (cpu_vendor_id == CPU_VENDOR_AMD &&
> > -           CPUID_TO_FAMILY(cpu_id) == 0x10 && amd10h_L1TP)
> > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > +           CPUID_TO_FAMILY(cpu_id) == 0x14) && amd10h_L1TP)
> >                 workaround_erratum383 = 1;
> > 
> >         mtx_init(&mca_lock, "mca", NULL, MTX_SPIN);
> > Index: i386/i386/pmap.c
> > ===================================================================
> > --- i386/i386/pmap.c    (revision 219060)
> > +++ i386/i386/pmap.c    (working copy)
> > @@ -758,7 +758,8 @@ pmap_init(void)
> >          * machine monitor.
> >          */
> >         if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
> > -           CPUID_TO_FAMILY(cpu_id) == 0x10)
> > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > +           CPUID_TO_FAMILY(cpu_id) == 0x14))
> >                 workaround_erratum383 = 1;
> > 
> >         /*
> > Index: amd64/amd64/pmap.c
> > ===================================================================
> > --- amd64/amd64/pmap.c  (revision 219060)
> > +++ amd64/amd64/pmap.c  (working copy)
> > @@ -727,7 +727,8 @@ pmap_init(void)
> >          * machine monitor.
> >          */
> >         if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
> > -           CPUID_TO_FAMILY(cpu_id) == 0x10)
> > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > +           CPUID_TO_FAMILY(cpu_id) == 0x14))
> >                 workaround_erratum383 = 1;
> > 
> >         /*
> 
> I have applied the patch, but got another one today. I still do not get
> a prompt or dump. :-( It just get stuck right after #4. If there is anything
> more that I can try, just ask.
> 
> #####################################################################
> MCA: Bank 0, Status 0xb600000000010015
> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
> MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
> MCA: CPU 0 UNCOR PCC DTLB L1 error
> MCA: Address 0x808ace000
> 
> 
> Fatal trap 28: machine check trap while in user mode
> cpuid = 1; apic id = 01
> instruction pointer	= 0x43:0x80af206d5
> stack pointer	        = 0x3b:0x7fffffffb8e8
> frame pointer	        = 0x3b:0x809b92450
> code segment		= base 0x0, limit 0xfffff, type 0x1b
> 			= DPL 3, pres 1, long 1, def32 0, gran 1
> processor eflags	= interrupt enabled, IOPL = 0
> current process		= 22228 (initial thread)
> trap number		= 28
> panic: machine check trap
> cpuid = 1
> KDB: stack backtrace:
> #0 0xffffffff80608f6e at kdb_backtrace+0x5e
> #1 0xffffffff805d6917 at panic+0x187
> #2 0xffffffff808bf7c0 at trap_fatal+0x290
> #3 0xffffffff808bfda9 at trap+0x109
> #4 0xffffffff808a8084 at calltrap+0x8
> #####################################################################
> 

Some extra info. The machine is my new "always on" machine at home. Most
of the panics have happened while I was not there. My wife just mentioned
that it often happen when she was busy typing a reply in thunderbird. (I
do not use that machine for my email.) So I tried it, clicked reply on
one of her emails and within a few lines, it crashed.

John
-- 
John Hay -- jhay at meraka.csir.co.za / jhay at FreeBSD.org