Re: MCE: Does this look possibly like a slot issue?

From: Ultima <ultima1252_at_gmail.com>
Date: Tue, 21 Jun 2022 16:52:50 UTC
Completely agree with you, Rodney. The LGA on the motherboard
can be bent very easy when moving so I wanted to recommend this
last.

Larry, as Rodney mentioned, it's more or less your last option. This
is likely the CPU and not the module itself. There is still a small chance
that is motherboard/slot related, a way you can determine this is
by swapping the CPU's slot 0 <----> slot 1 and seeing if the error moves.
As I mentioned though, be very cautious. I don't want you to be in a
worse-off
state.

I would reseat the problem CPU socket before swapping the CPUs.

Best regards,
Richard Gallamore

On Tue, Jun 21, 2022 at 9:06 AM Rodney W. Grimes <
freebsd-rwg@gndrsh.dnsmgr.net> wrote:

> >
> >
> > Swapped 2 DIMMS, now we wait for the ZFS ARC to fill and start using all
> > the memory.
>
> Depending on the results of that one thing that is often overlooked
> when trying to trouble shoot memory systems in modern Intel systems
> is the fact that the DIMM now talks directly to the CPU chip that
> has the memory controller built into it.  THUS these "slot" related
> ECC/Parity/blowup errors can actually be the CPU and/or the CPU
> socket and/or the seating of the CPU in the socket.
>
> So if the error sticks with the DIMM slot and not the DIMM
> module the next thing I would try would be a CPU chip reseat,
> including a good inspection of the socket for for a damaged
> pin.  Also look at the lands on the CPU chip itself, and you
> can even try swaping CPU chips to see if it follows the
> CPU or the socket, much as you do with a DIMM.
>
>
> >
> > On 06/20/2022 7:59 pm, Larry Rosenman wrote:
> >
> > > SuperMicro X8DTN+
> > >
> > > 2 Processors, 6-core/12-Thread. CPU: Intel(R) Xeon(R) CPU
> > > E5645  @ 2.40GHz (2400.20-MHz K8-class CPU)
> > >
> > > I'll bring it down and swap DIMMS around
> > >
> > > On 06/20/2022 7:57 pm, Ultima wrote:
> > >
> > > Hey Larry,
> > >
> > > One red flag I am seeing is that the error is being produced on
> > > the same CPU/bank with each error you have provided so far.
> > >
> > > Can you try and follow my original recommendation and swap
> > > currently installed DIMM with the problem DIMM slot and see
> > > if anything changes?
> > >
> > > Can you also provide the motherboard model? Also, do you
> > > have multiple CPUs installed in this system?
> > >
> > > Best regards,
> > > Richard Gallamore
> > >
> > > On Mon, Jun 20, 2022 at 5:41 PM Larry Rosenman <ler@lerctr.org> wrote:
> > >
> > > Yes and Yes.
> > >
> > > On 06/20/2022 7:37 pm, Ultima wrote:
> > >
> > > Are you sure that the module you replaced it with was good?
> > > Are you sure you replaced the correct module?
> > >
> > > Best regards,
> > > Richard Gallamore
> > >
> > > On Mon, Jun 20, 2022 at 5:23 PM Larry Rosenman <ler@lerctr.org> wrote:
> > >
> > > I'm seeing them constantly:
> > >
> > > root@freenas[~]# mcelog --dmi
> > > Hardware event. This is not a software error.
> > > MCE 0
> > > CPU 22 BANK 8 TSC 20aab486464a
> > > MISC ac29890200046444 ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 44
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > Hardware event. This is not a software error.
> > > MCE 1
> > > CPU 22 BANK 8 TSC 296dfcc82582
> > > MISC ac29890200041381 ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 81
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > Hardware event. This is not a software error.
> > > MCE 2
> > > CPU 22 BANK 8 TSC 2a5604a6a070
> > > MISC ac29890200044281
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory ECC error occurred during scrub
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 81
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 88000040000200cf MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > Hardware event. This is not a software error.
> > > MCE 3
> > > CPU 22 BANK 8 TSC 31e141418eb8
> > > MISC ac29890200046a4a ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 4a
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > Hardware event. This is not a software error.
> > > MCE 4
> > > CPU 22 BANK 8 TSC 3a014afee106
> > > MISC ac29890200046646 ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 46
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > Hardware event. This is not a software error.
> > > MCE 5
> > > CPU 22 BANK 8 TSC 41d1dbef1a6a
> > > MISC ac29890200046141 ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 41
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > Hardware event. This is not a software error.
> > > MCE 6
> > > CPU 22 BANK 8 TSC 4a1b1ecef446
> > > MISC ac29890200046a4a ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 4a
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > Hardware event. This is not a software error.
> > > MCE 7
> > > CPU 22 BANK 8 TSC 527bc27db776
> > > MISC ac29890200040386 ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 86
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > Hardware event. This is not a software error.
> > > MCE 8
> > > CPU 22 BANK 8 TSC 5aa4ecdd795a
> > > MISC ac29890200046646 ADDR ee2f6e800
> > > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 46
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > > root@freenas[~]#
> > >
> > > and I replaced the DIMM yesterday :(
> > >
> > > On 06/20/2022 7:19 pm, Ultima wrote:
> > >
> > > Hey Larry,
> > >
> > > It is possible it's the motherboard itself, but it's rare. The way I
> > > would determine this is to swap the DIMM module with another
> > > populated slot on the motherboard and see if the error migrated
> > > to the new slot or not. Also, this error doesn't necessarily mean
> > > there is a problem that needs to be addressed. If you have been
> > > running the system for many months and you see ECC errors a
> > > handful of times, it can probably be safely ignored.
> > >
> > > Best regards,
> > > Richard Gallamore
> > >
> > > On Mon, Jun 20, 2022 at 3:14 PM Larry Rosenman <ler@lerctr.org>
> wrote:
> > > I've gotten a BUNCH of these on my TrueNAS server.  I've replaced this
> > > DIMM a couple of times, and still the MCE's continue.
> > > Is it possible it's Motherboard slot issue?
> > >
> > > Hardware event. This is not a software error.
> > > MCE 8
> > > CPU 22 BANK 8 TSC 5aa4ecdd795a
> > > MISC ac29890200046646 ADDR ee2f6e800
> > > TIME 1655762472 Mon Jun 20 17:01:12 2022
> > > MCG status:
> > > Memory read ECC error
> > > Memory corrected error count (CORE_ERR_CNT): 1
> > > Memory transaction Tracker ID (RTId): 46
> > > Memory DIMM ID of error: 0
> > > Memory channel ID of error: 1
> > > Memory ECC syndrome: ac298902
> > > STATUS 8c0000400001009f MCGSTATUS 0
> > > MCGCAP 1c09 APICID 34 SOCKETID 0
> > > CPUID Vendor Intel Family 6 Model 44 Step 2
> > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > > Device Locator: P2-DIMM2C
> > > Bank Locator: BANK14
> > > Manufacturer: Hyundai
> > > Serial Number: 40F3C20F
> > > Asset Tag:
> > > Part Number: HMT151R7BFR4C-H9
> > >
> > > --
> > > Larry Rosenman                     http://www.lerctr.org/~ler
> > > Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
> > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
> >
> > --
> > Larry Rosenman                     http://www.lerctr.org/~ler
> > Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
> > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
> >
> > --
> > Larry Rosenman                     http://www.lerctr.org/~ler
> > Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
> > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
> >
> > --
> > Larry Rosenman                     http://www.lerctr.org/~ler
> > Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
> > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
> >
> > --
> > Larry Rosenman                     http://www.lerctr.org/~ler
> > Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
> > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
>
> --
> Rod Grimes
> rgrimes@freebsd.org
>