Fwd: kernel: MCA: CPU 0 COR (1) internal parity error

Matthias Apitz guru at unixarea.de
Sun Jan 18 06:08:50 UTC 2015



Hello,

I'm running since some days a recent -HEAD r276659 on an Acer C720 Chromebook
which works very nicely and fast (I really have never seen such a fast KDE4 desktop).

>From time to time (let's say 2-3 times a day) I see messages like this
in /var/log/messages:

Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005
Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0
Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error

the kernel is:

# uname -a
FreeBSD c720-r276659 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r276659M: Tue Jan  6 12:55:25 CET 2015
guru at vm-poudriere-r269739:/usr/local/acerC720/obj/usr/local/acerC720/src/sys/GENERIC i386

i.e. the i386 version (because I compile everything, kernel and ports, in a VMbox)

I'm attaching below the complete 'dmesg' lines with the information
details about the CPU. 

I raised questions about these MCA messages in freebsd-current@ and was
pointed to a tool in ports/sysutils/mcelog.  Jeremy Chadwick <jdc at koitsu.org>
the maintainer of mcelog, made hints about the issue, see below, and
asked me to bring this up in freebsd-hackers@

Are these messages really a hardware problem or do our kernel
misreporting or mis-decoding of some hardware information.

Despite of the messages, the system does not show any other faults or
PANICs.

Thanks

	matthias

----- Forwarded message from Jeremy Chadwick <jdc at koitsu.org> -----

Date: Sat, 17 Jan 2015 13:46:53 -0800
From: Jeremy Chadwick <jdc at koitsu.org>
To: Matthias Apitz <guru at unixarea.de>, Eric van Gyzen <eric at vangyzen.net>,
	freebsd-current at freebsd.org
Subject: Re: kernel: MCA: CPU 0 COR (1) internal parity error

On Sat, Jan 17, 2015 at 06:43:26PM +0100, Matthias Apitz wrote:
> El día Friday, January 16, 2015 a las 03:04:52PM -0500, Eric van Gyzen escribió:
> 
> > On 01/16/2015 14:45, Matthias Apitz wrote:
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error
> > 
> > Try ports/sysutils/mcelog.
> 
> I have installed that port and launched it as
> 
> # mcelog > mcelog.txt
> ...
> mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors
> mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors
> mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors
> ...
> 
> (the messages are STDERR);
> 
> in 'mcelog.txt' it has for the last event from /var/log/messages:
> 
> Jan 17 18:23:54 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005
> Jan 17 18:23:54 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
> Jan 17 18:23:54 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0
> Jan 17 18:23:54 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error
> 
> the following lines (the uptime matches):
> 
> ...
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> MCE 32
> CPU 0 BANK 0 TSC 36eec80fd688 [at 1397 Mhz 0 days 12:0:41 uptime (unreliable)]
> MCG status:
> MCi status:
> Error enabled
> MCA: Unknown Error 5
> STATUS 90000040000f0005 MCGSTATUS 0
> MCGCAP c07 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 69
> 
> Questions:
> a) Is the output of mcelog valid (regardless of the msg on STDERR of
>    'unsupported model')?

It may or may not be reliable.  For MCE decoding to work accurately, the
software (read: kernel) needs to have full support for the processor
model and revision in question.  mcelog simply tries to decode the
output that the kernel spits out and provide a more "user-friendly"
explanation.

That isn't as simple as just modifying some table of supported CPUs; it
involves reading Intel documentation and implementing what can be
figured out through that.  VMware has a small KB about this, to give you
some insight into the complexity:

http://kb.vmware.com/kb/1005184

There are some capabilities of MCA that are "semi-universal" across
series of CPUs, so sometimes those can be decoded (mostly) accurately,
but other times such isn't the case.  Sometimes there are certain MCEs
that have be ignored by the kernel (i.e. the kernel MCE support has to
be updated to reflect changes in MCEs for that newer model of
processor).

The version of mcelog available in ports is extremely old, and the
amount of work to upgrade it to the latest Linux mcelog (1.08) I imagine
would be quite large:

http://git.kernel.org/cgit/utils/cpu/mce/mcelog.git

The existing FreeBSD port involves a large number of patches written by
John Baldwin, and whether or not those can be correctly backported to
newer mcelog releases is unknown.

I really need to renounce my maintainer flag of that port and let
someone else take care of it.

> b) Is it worth to contact the dealer or wait until it is broken
>    completely?

To me, the above message indicates that one of the CPU cores is
damaged/misbehaving.  I cannot determine if it's referring to L1, L2, or
L3 cache, but I don't see any clear indicator of that (possibly due to
the aforementioned explanation I gave about accuracy).

However, I will point you to this thread, which may indicate that the
model of CPU in question (or series or models of Intel CPUs) have MCEs
that happen which are considered "normal" and are thus not being decoded
correctly:

https://lists.freebsd.org/pipermail/freebsd-questions/2014-January/255873.html

I would suggest providing relevant dmesg lines about your exact
processor in this system and possibly ask for help from either John
Baldwin or someone on freebsd-hackers at .  I myself cannot help with this.
The dmesg lines I'm referring to, by the way, look like this (all of
them matter, particularly the first two):

CPU: Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz (2833.59-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x10677  Family = 0x6  Model = 0x17  Stepping = 7
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x8e3fd<SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant, performance statistics

The OP of that freebsd-questions thread should have provided this but
didn't (instead just says "Intel i3-4310" -- this isn't precise enough),
so whether or not you two are using the same CPU is unknown.

There simply could be "new MCEs" or changes to the MCA that Intel
implemented in some newer models of Core iX that aren't being handled
correctly by the kernel (i.e. misreporting or mis-decoding).

Good luck!

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |


----- End forwarded message -----


Here comes the dmesg' output:

Copyright (c) 1992-2015 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.0-CURRENT #0 r276659M: Tue Jan  6 12:55:25 CET 2015
    guru at vm-poudriere-r269739:/usr/local/acerC720/obj/usr/local/acerC720/src/sys/GENERIC i386
FreeBSD clang version 3.5.0 (tags/RELEASE_350/final 216957) 20141124
VT: running with driver "vga".
CPU: Intel(R) Celeron(R) 2955U @ 1.40GHz (1396.80-MHz 686-class CPU)
  Origin="GenuineIntel"  Id=0x40651  Family=0x6  Model=0x45  Stepping=1
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4ddaebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,<b11>,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,XSAVE,OSXSAVE,RDRAND>
  AMD Features=0x2c100000<NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x21<LAHF,ABM>
  Structured Extended Features=0x2603<FSGSBASE,TSCADJ,ERMS,INVPCID>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: (disabled in BIOS) PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 2079825920 (1983 MB)
avail memory = 2014580736 (1921 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <CORE   COREBOOT>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  2
ioapic0 <Version 2.0> irqs 0-39 on motherboard
Cuse4BSD v0.1.33 @ /dev/cuse
random: entropy device infrastructure driver
random: selecting highest priority adaptor <Dummy>
kbd1 at kbdmux0
module_register_init: MOD_LOAD (vesa, 0xc0fb0310, 0) error 19
random: live provider: "Intel Secure Key RNG"
random: SOFT: yarrow init()
random: selecting highest priority adaptor <Yarrow>
vtvga0: <vt_vga driver> on motherboard
acpi0: <CORE COREBOOT> on motherboard
acpi0: Power Button (fixed)
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 550
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
Event timer "HPET3" frequency 14318180 Hz quality 440
Event timer "HPET4" frequency 14318180 Hz quality 440
Event timer "HPET5" frequency 14318180 Hz quality 440
Event timer "HPET6" frequency 14318180 Hz quality 440
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
atrtc0: <AT realtime clock> port 0x70-0x77 on acpi0
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0
acpi_ec0: <Embedded Controller: GPE 0x24> port 0x62,0x66 on acpi0
acpi_lid0: <Control Method Lid Switch> on acpi0
acpi_button0: <Power Button> on acpi0
acpi_button1: <Sleep Button> irq 37 on acpi0
acpi_button2: <Sleep Button> irq 38 on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0x1800-0x183f mem 0xe0000000-0xe03fffff,0xd0000000-0xdfffffff at device 2.0 on pci0
vgapci0: Boot video device
hdac0: <Intel Haswell HDA Controller> mem 0xe0510000-0xe0513fff at device 3.0 on pci0
xhci0: <Intel Panther Point USB 3.0 controller> mem 0xe0500000-0xe050ffff at device 20.0 on pci0
xhci0: 32 byte context size.
xhci0: Port routing mask set to 0xffffffff
usbus0 on xhci0
pci0: <base peripheral, DMA controller> at device 21.0 (no driver attached)
ig4iic0: <Intel Lynx Point-LP I2C Controller-1> mem 0xe051a000-0xe051afff,0xe051b000-0xe051bfff at device 21.1 on pci0
ig4iic0: Using MSI
type 44570140 params 001f1fee general 55000000 (updated 55000004) swltr 00000800 autoltr 00000800 version 3131352a
SS_SCL_HCNT=00000190 LCNT=000001d6 FS_SCL_HCNT=0000003c LCNT=00000082
HOLD        00000001
ig4iic1: <Intel Lynx Point-LP I2C Controller-2> mem 0xe051c000-0xe051cfff,0xe051d000-0xe051dfff at device 21.2 on pci0
ig4iic1: Using MSI
type 44570140 params 001f1fee general 55000000 (updated 55000004) swltr 00000800 autoltr 00000800 version 3131352a
SS_SCL_HCNT=00000190 LCNT=000001d6 FS_SCL_HCNT=0000003c LCNT=00000082
HOLD        00000001
hdac1: <Intel Lynx Point-LP HDA Controller> mem 0xe0514000-0xe0517fff at device 27.0 on pci0
pcib1: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pci1: <ACPI PCI bus> on pcib1
ath0: <Atheros AR946x/AR948x> mem 0xe0400000-0xe047ffff at device 0.0 on pci1
ar9300_attach: calling ar9300_hw_attach
ar9300_hw_attach: calling ar9300_eeprom_attach
ar9300_flash_map: unimplemented for now
Restoring Cal data from DRAM
Restoring Cal data from EEPROM
Restoring Cal data from Flash
Restoring Cal data from Flash
Restoring Cal data from OTP
ar9300_hw_attach: ar9300_eeprom_attach returned 0
ath0: [HT] enabling HT modes
ath0: [HT] enabling short-GI in 20MHz mode
ath0: [HT] 1 stream STBC receive enabled
ath0: [HT] 1 stream STBC transmit enabled
ath0: [HT] 2 RX streams; 2 TX streams
ath0: AR9460 mac 640.2 RF5110 phy 1924.13
ath0: 2GHz radio: 0x0000; 5GHz radio: 0x0000
ehci0: <Intel Lynx Point LP USB 2.0 controller USB> mem 0xe051f800-0xe051fbff at device 29.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
ahci0: <Intel Lynx Point-LP AHCI SATA controller> port 0x1860-0x1867,0x1870-0x1873,0x1868-0x186f,0x1874-0x1877,0x1840-0x185f mem 0xe051f000-0xe051f7ff irq 22 at device 31.2 on pci0
ahci0: AHCI v1.30 with 2 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
acpi_tz0: <Thermal Zone> on acpi0
acpi_acad0: <AC Adapter> on acpi0
battery0: <ACPI Control Method Battery> on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
pmtimer0 on isa0
ata0: <ATA channel> at port 0x1f0-0x1f7,0x3f6 irq 14 on isa0
ata1: <ATA channel> at port 0x170-0x177,0x376 irq 15 on isa0
ppc0: parallel port not found.
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
coretemp1: <CPU On-Die Thermal Sensors> on cpu1
est1: <Enhanced SpeedStep Frequency Control> on cpu1
Timecounters tick every 1.000 msec
IP Filter: v5.1.2 initialized.  Default = pass all, Logging = enabled
hdacc0: <Intel Haswell HDA CODEC> at cad 0 on hdac0
hdaa0: <Intel Haswell Audio Function Group> at nid 1 on hdacc0
pcm0: <Intel Haswell (HDMI/DP 8ch)> at nid 3 on hdaa0
smbus0: <System Management Bus> on ig4iic0
usbus0: 5.0Gbps Super Speed USB v3.0
usbus1: 480Mbps High Speed USB v2.0
ugen0.1: <0x8086> at usbus0
uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ugen1.1: <Intel> at usbus1
uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
uhub0: 13 ports with 13 removable, self powered
uhub1: 2 ports with 2 removable, self powered
smbus0: Probed address 0x67
No address ptr set, parent smbus
No address ptr set
isl_probe called on unknown I2C device: 103
ugen0.2: <SunplusIT Inc> at usbus0
ugen1.2: <vendor 0x8087> at usbus1
uhub2: <vendor 0x8087 product 0x8000, class 9/0, rev 2.00/0.04, addr 2> on usbus1
uhub2: 8 ports with 8 removable, self powered
cyapa0: <Cypress APA I2C Trackpad> on smbus0
cyapa0: cyapa init status 8f
cyapa0: CYTRA-103006-00 buttons=LM- res=870x470
smbus1: <System Management Bus> on ig4iic1
smbus1: Probed address 0x44
No address ptr set, parent smbus
No address ptr set
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
cyapa_probe called on unknown I2C device: 68
random: unblocking device.
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
isl0: <ISL Digital Ambient Light Sensor> on smbus1
isl0: Sending command 32
isl0: Sending command 64
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <TS128GMTS400 N0815B> ATA-9 SATA 3.x device
ada0: Serial Number B862500493
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 1024bytes)
ada0: Command Queueing enabled
ada0: 122104MB (250069680 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4
isl0: Sending command 96
hdacc1: <Realtek (0x0283) HDA CODEC> at cad 0 on hdac1
hdaa1: <Realtek (0x0283) Audio Function Group> at nid 1 on hdacc1
pcm1: <Realtek (0x0283) (Analog 2.0+HP/2.0)> at nid 20,33 and 26,25 on hdaa1
SMP: AP CPU #1 Launched!
Timecounter "TSC" frequency 1396798064 Hz quality 1000
Root mount waiting for: usbus0
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
Root mount waiting for: usbus0
Root mount waiting for: usbus0
Root mount waiting for: usbus0
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
Root mount waiting for: usbus0
Root mount waiting for: usbus0
Root mount waiting for: usbus0
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
ugen0.3: <Unknown> at usbus0 (disconnected)
uhub_reattach_port: could not allocate new device
Trying to mount root from ufs:/dev/ada0p2 [rw,noatime]...
wlan0: Ethernet address: 80:56:f2:83:c1:17
wlan0: link state changed to UP
info: [drm] Initialized drm 1.1.0 20060810
MCA: Bank 0, Status 0x90000040000f0005
MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 2
MCA: CPU 1 COR (1) internal parity error
MCA: Bank 0, Status 0x90000040000f0005
MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 2
MCA: CPU 1 COR (1) internal parity error

-- 
Matthias Apitz, guru at unixarea.de, http://www.unixarea.de/ +49-170-4527211
1989-2014: The Wall was torn down so that we go to war together again.
El Muro ha sido derribado para que nos unimos en ir a la guerra otra vez.
Diese Grenze wurde aufgehoben damit wir gemeinsam wieder in den Krieg ziehen.


More information about the freebsd-hackers mailing list