Re: MCE: Does this look possibly like a slot issue?

From: Chris <bsd-lists_at_bsdforge.com>
Date: Tue, 21 Jun 2022 18:23:24 UTC
On 2022-06-20 17:23, Larry Rosenman wrote:
> I'm seeing them constantly:
FWIW it looks like a sync(ing) problem between your
RAM && CPU cache. Are are your clocks set correctly
for your CPU && RAM? Is your CPU too hot? Is the CPU
cache ECC?
> 
> root@freenas[~]# mcelog --dmi
> Hardware event. This is not a software error.
> MCE 0
> CPU 22 BANK 8 TSC 20aab486464a
> MISC ac29890200046444 ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 44
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> Hardware event. This is not a software error.
> MCE 1
> CPU 22 BANK 8 TSC 296dfcc82582
> MISC ac29890200041381 ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 81
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> Hardware event. This is not a software error.
> MCE 2
> CPU 22 BANK 8 TSC 2a5604a6a070
> MISC ac29890200044281
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory ECC error occurred during scrub
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 81
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 88000040000200cf MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> Hardware event. This is not a software error.
> MCE 3
> CPU 22 BANK 8 TSC 31e141418eb8
> MISC ac29890200046a4a ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 4a
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> Hardware event. This is not a software error.
> MCE 4
> CPU 22 BANK 8 TSC 3a014afee106
> MISC ac29890200046646 ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 46
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> Hardware event. This is not a software error.
> MCE 5
> CPU 22 BANK 8 TSC 41d1dbef1a6a
> MISC ac29890200046141 ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 41
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> Hardware event. This is not a software error.
> MCE 6
> CPU 22 BANK 8 TSC 4a1b1ecef446
> MISC ac29890200046a4a ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 4a
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> Hardware event. This is not a software error.
> MCE 7
> CPU 22 BANK 8 TSC 527bc27db776
> MISC ac29890200040386 ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 86
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> Hardware event. This is not a software error.
> MCE 8
> CPU 22 BANK 8 TSC 5aa4ecdd795a
> MISC ac29890200046646 ADDR ee2f6e800
> TIME 1655770989 Mon Jun 20 19:23:09 2022
> MCG status:
> Memory read ECC error
> Memory corrected error count (CORE_ERR_CNT): 1
> Memory transaction Tracker ID (RTId): 46
> Memory DIMM ID of error: 0
> Memory channel ID of error: 1
> Memory ECC syndrome: ac298902
> STATUS 8c0000400001009f MCGSTATUS 0
> MCGCAP 1c09 APICID 34 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44 Step 2
> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> Device Locator: P2-DIMM2C
> Bank Locator: BANK14
> Manufacturer: Hyundai
> Serial Number: 40F3C20F
> Asset Tag:
> Part Number: HMT151R7BFR4C-H9
> root@freenas[~]#
> 
> and I replaced the DIMM yesterday :(
> 
> On 06/20/2022 7:19 pm, Ultima wrote:
> 
>> Hey Larry,
>> 
>> It is possible it's the motherboard itself, but it's rare. The way I
>> would determine this is to swap the DIMM module with another
>> populated slot on the motherboard and see if the error migrated
>> to the new slot or not. Also, this error doesn't necessarily mean
>> there is a problem that needs to be addressed. If you have been
>> running the system for many months and you see ECC errors a
>> handful of times, it can probably be safely ignored.
>> 
>> Best regards,
>> Richard Gallamore
>> 
>> On Mon, Jun 20, 2022 at 3:14 PM Larry Rosenman <ler@lerctr.org> wrote:
>> 
>>> I've gotten a BUNCH of these on my TrueNAS server.  I've replaced this
>>> DIMM a couple of times, and still the MCE's continue.
>>> Is it possible it's Motherboard slot issue?
>>> 
>>> Hardware event. This is not a software error.
>>> MCE 8
>>> CPU 22 BANK 8 TSC 5aa4ecdd795a
>>> MISC ac29890200046646 ADDR ee2f6e800
>>> TIME 1655762472 Mon Jun 20 17:01:12 2022
>>> MCG status:
>>> Memory read ECC error
>>> Memory corrected error count (CORE_ERR_CNT): 1
>>> Memory transaction Tracker ID (RTId): 46
>>> Memory DIMM ID of error: 0
>>> Memory channel ID of error: 1
>>> Memory ECC syndrome: ac298902
>>> STATUS 8c0000400001009f MCGSTATUS 0
>>> MCGCAP 1c09 APICID 34 SOCKETID 0
>>> CPUID Vendor Intel Family 6 Model 44 Step 2
>>> DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
>>> Device Locator: P2-DIMM2C
>>> Bank Locator: BANK14
>>> Manufacturer: Hyundai
>>> Serial Number: 40F3C20F
>>> Asset Tag:
>>> Part Number: HMT151R7BFR4C-H9
>>> 
>>> --
>>> Larry Rosenman                     http://www.lerctr.org/~ler
>>> Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
>>> US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106