From nobody Tue Jun 21 16:52:50 2022 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 9C31287EFE8 for ; Tue, 21 Jun 2022 16:53:03 +0000 (UTC) (envelope-from ultima1252@gmail.com) Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4LSCHz0891z3h3T for ; Tue, 21 Jun 2022 16:53:03 +0000 (UTC) (envelope-from ultima1252@gmail.com) Received: by mail-lf1-x12e.google.com with SMTP id i18so9741884lfu.8 for ; Tue, 21 Jun 2022 09:53:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=LQ31fJFN2OW+QsAz5Pwhe0Q40AuTBfCBMJH8ED6QozY=; b=VNqkYIpDfOzEbZ5SMw2OvVti92XWrpv+XfMGqM8E3Jxmd80aNWWPzg4/+0WtqV+aFn pkyAzMRliQNZGM5SQ2E1xPOdH0MDmDUQ44N1jRz7YSPquXIxNtpS3SbZ2wyDrNoGZFGU TD1uvhYOONrfAzwQIjEAbVJb2ZWetFCydFFVnDCIxKkA0xjtpmdUli5xx9voIcLFfnZL 6RSUTxTGWLV1MmASEl2As592Jn0ndEgXEASM1KpXh5+xEaqXcSOMRpQ5lbPM/lr5PUVs YStrLKAvl+IlHmGZcYUBDGpqIWzMjUypu48YD4wBADvxDRHo+nRVmF6KWqAf7m7H1kSs mUCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LQ31fJFN2OW+QsAz5Pwhe0Q40AuTBfCBMJH8ED6QozY=; b=zeWwXVAHZNfGTq9/nUj9TClt5UErttg1GntqVqRrnB8QTz/NCQha8igVQTK297rOxf YjZr+yJiA3eRbVKGPooEujJw07+agcNMWG13PheYc5ZEdywQ8BO+Q9Ny+Nc1+naxvDOW UOfqAQAdzRzeIAnUk9iSVJC+kvxeHCTvYpJaYU0W92N6RfxVlIcb/+5yyNDfqX5pBhmS 0j7SeB4lCqu6HA3DuB12QXwVH7XJRXoAvPl1gNAc8Via+rve9zIPk5BcplIVA8paJ7W2 UeiV4XnHEX7sknf8eaGUlAUo8ohNbciewyMZ9gPreRlSFd5mcqisn0fA2cw8VpdKpymy snJQ== X-Gm-Message-State: AJIora+Bh+28UtAvr/EtAxyDi4ChLfm0uEJ1+5aVwuIwLjtAsMc9eC0H RADQBdfZXJXGfomHhnhKPPZWCP/sHoV2cy+68X4oOMgV3QE= X-Google-Smtp-Source: AGRyM1tiHkHVL6GnGCkXGnHhe1sVftyEkKPVmbiC8pPf/ZT9TlChuMHQajengM1vN3lmN3yQEpazt5LgKrMLISxXLhw= X-Received: by 2002:a05:6512:3a84:b0:479:209a:578a with SMTP id q4-20020a0565123a8400b00479209a578amr16751630lfu.292.1655830381607; Tue, 21 Jun 2022 09:53:01 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <202206211606.25LG6Out053747@gndrsh.dnsmgr.net> In-Reply-To: <202206211606.25LG6Out053747@gndrsh.dnsmgr.net> From: Ultima Date: Tue, 21 Jun 2022 09:52:50 -0700 Message-ID: Subject: Re: MCE: Does this look possibly like a slot issue? To: "Rodney W. Grimes" Cc: Larry Rosenman , Freebsd current Content-Type: multipart/alternative; boundary="000000000000b43f7505e1f80db6" X-Rspamd-Queue-Id: 4LSCHz0891z3h3T X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=VNqkYIpD; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of ultima1252@gmail.com designates 2a00:1450:4864:20::12e as permitted sender) smtp.mailfrom=ultima1252@gmail.com X-Spamd-Result: default: False [-0.42 / 15.00]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; FREEMAIL_FROM(0.00)[gmail.com]; MID_RHS_MATCH_FROMTLD(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; SUBJECT_ENDS_QUESTION(1.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.40)[-0.401]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_SPAM_SHORT(0.98)[0.983]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-current@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::12e:from]; MLMMJ_DEST(0.00)[freebsd-current]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --000000000000b43f7505e1f80db6 Content-Type: text/plain; charset="UTF-8" Completely agree with you, Rodney. The LGA on the motherboard can be bent very easy when moving so I wanted to recommend this last. Larry, as Rodney mentioned, it's more or less your last option. This is likely the CPU and not the module itself. There is still a small chance that is motherboard/slot related, a way you can determine this is by swapping the CPU's slot 0 <----> slot 1 and seeing if the error moves. As I mentioned though, be very cautious. I don't want you to be in a worse-off state. I would reseat the problem CPU socket before swapping the CPUs. Best regards, Richard Gallamore On Tue, Jun 21, 2022 at 9:06 AM Rodney W. Grimes < freebsd-rwg@gndrsh.dnsmgr.net> wrote: > > > > > > Swapped 2 DIMMS, now we wait for the ZFS ARC to fill and start using all > > the memory. > > Depending on the results of that one thing that is often overlooked > when trying to trouble shoot memory systems in modern Intel systems > is the fact that the DIMM now talks directly to the CPU chip that > has the memory controller built into it. THUS these "slot" related > ECC/Parity/blowup errors can actually be the CPU and/or the CPU > socket and/or the seating of the CPU in the socket. > > So if the error sticks with the DIMM slot and not the DIMM > module the next thing I would try would be a CPU chip reseat, > including a good inspection of the socket for for a damaged > pin. Also look at the lands on the CPU chip itself, and you > can even try swaping CPU chips to see if it follows the > CPU or the socket, much as you do with a DIMM. > > > > > > On 06/20/2022 7:59 pm, Larry Rosenman wrote: > > > > > SuperMicro X8DTN+ > > > > > > 2 Processors, 6-core/12-Thread. CPU: Intel(R) Xeon(R) CPU > > > E5645 @ 2.40GHz (2400.20-MHz K8-class CPU) > > > > > > I'll bring it down and swap DIMMS around > > > > > > On 06/20/2022 7:57 pm, Ultima wrote: > > > > > > Hey Larry, > > > > > > One red flag I am seeing is that the error is being produced on > > > the same CPU/bank with each error you have provided so far. > > > > > > Can you try and follow my original recommendation and swap > > > currently installed DIMM with the problem DIMM slot and see > > > if anything changes? > > > > > > Can you also provide the motherboard model? Also, do you > > > have multiple CPUs installed in this system? > > > > > > Best regards, > > > Richard Gallamore > > > > > > On Mon, Jun 20, 2022 at 5:41 PM Larry Rosenman wrote: > > > > > > Yes and Yes. > > > > > > On 06/20/2022 7:37 pm, Ultima wrote: > > > > > > Are you sure that the module you replaced it with was good? > > > Are you sure you replaced the correct module? > > > > > > Best regards, > > > Richard Gallamore > > > > > > On Mon, Jun 20, 2022 at 5:23 PM Larry Rosenman wrote: > > > > > > I'm seeing them constantly: > > > > > > root@freenas[~]# mcelog --dmi > > > Hardware event. This is not a software error. > > > MCE 0 > > > CPU 22 BANK 8 TSC 20aab486464a > > > MISC ac29890200046444 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 44 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > WARNING: SMBIOS data is often unreliable. Take with a grain of salt! > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 1 > > > CPU 22 BANK 8 TSC 296dfcc82582 > > > MISC ac29890200041381 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 81 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 2 > > > CPU 22 BANK 8 TSC 2a5604a6a070 > > > MISC ac29890200044281 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory ECC error occurred during scrub > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 81 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 88000040000200cf MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > Hardware event. This is not a software error. > > > MCE 3 > > > CPU 22 BANK 8 TSC 31e141418eb8 > > > MISC ac29890200046a4a ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 4a > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 4 > > > CPU 22 BANK 8 TSC 3a014afee106 > > > MISC ac29890200046646 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 46 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 5 > > > CPU 22 BANK 8 TSC 41d1dbef1a6a > > > MISC ac29890200046141 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 41 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 6 > > > CPU 22 BANK 8 TSC 4a1b1ecef446 > > > MISC ac29890200046a4a ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 4a > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 7 > > > CPU 22 BANK 8 TSC 527bc27db776 > > > MISC ac29890200040386 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 86 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 8 > > > CPU 22 BANK 8 TSC 5aa4ecdd795a > > > MISC ac29890200046646 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 46 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > root@freenas[~]# > > > > > > and I replaced the DIMM yesterday :( > > > > > > On 06/20/2022 7:19 pm, Ultima wrote: > > > > > > Hey Larry, > > > > > > It is possible it's the motherboard itself, but it's rare. The way I > > > would determine this is to swap the DIMM module with another > > > populated slot on the motherboard and see if the error migrated > > > to the new slot or not. Also, this error doesn't necessarily mean > > > there is a problem that needs to be addressed. If you have been > > > running the system for many months and you see ECC errors a > > > handful of times, it can probably be safely ignored. > > > > > > Best regards, > > > Richard Gallamore > > > > > > On Mon, Jun 20, 2022 at 3:14 PM Larry Rosenman > wrote: > > > I've gotten a BUNCH of these on my TrueNAS server. I've replaced this > > > DIMM a couple of times, and still the MCE's continue. > > > Is it possible it's Motherboard slot issue? > > > > > > Hardware event. This is not a software error. > > > MCE 8 > > > CPU 22 BANK 8 TSC 5aa4ecdd795a > > > MISC ac29890200046646 ADDR ee2f6e800 > > > TIME 1655762472 Mon Jun 20 17:01:12 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 46 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > > > > -- > > > Larry Rosenman http://www.lerctr.org/~ler > > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > -- > Rod Grimes > rgrimes@freebsd.org > --000000000000b43f7505e1f80db6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Completely agree with you, Rodney. The LGA on the mot= herboard
can be bent very easy when moving so I wanted to recomme= nd this
last.

Larry, as Rodney mentioned= , it's more or less your last option. This
is likely the CPU = and not the module itself. There is still a small chance
that is = motherboard/slot related, a way you can determine this is
by swap= ping the CPU's slot 0 <----> slot 1 and seeing if the error moves= .
As I mentioned though, be very cautious. I don't want y= ou to be in a worse-off
state.

I would r= eseat the problem CPU socket before swapping the CPUs.

Best regards,
Richard Gallamore

On T= ue, Jun 21, 2022 at 9:06 AM Rodney W. Grimes <freebsd-rwg@gndrsh.dnsmgr.net> wrote:
>
>
> Swapped 2 DIMMS, now we wait for the ZFS ARC to fill and start using a= ll
> the memory.

Depending on the results of that one thing that is often overlooked
when trying to trouble shoot memory systems in modern Intel systems
is the fact that the DIMM now talks directly to the CPU chip that
has the memory controller built into it.=C2=A0 THUS these "slot" = related
ECC/Parity/blowup errors can actually be the CPU and/or the CPU
socket and/or the seating of the CPU in the socket.=C2=A0

So if the error sticks with the DIMM slot and not the DIMM
module the next thing I would try would be a CPU chip reseat,
including a good inspection of the socket for for a damaged
pin.=C2=A0 Also look at the lands on the CPU chip itself, and you
can even try swaping CPU chips to see if it follows the
CPU or the socket, much as you do with a DIMM.


>
> On 06/20/2022 7:59 pm, Larry Rosenman wrote:
>
> > SuperMicro X8DTN+
> >
> > 2 Processors, 6-core/12-Thread. CPU: Intel(R) Xeon(R) CPU=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> > E5645=C2=A0 @ 2.40GHz (2400.20-MHz K8-class CPU)
> >
> > I'll bring it down and swap DIMMS around
> >
> > On 06/20/2022 7:57 pm, Ultima wrote:
> >
> > Hey Larry,
> >
> > One red flag I am seeing is that the error is being produced on > > the same CPU/bank with each error you have provided so far.
> >
> > Can you try and follow my original recommendation and swap
> > currently installed DIMM with the problem DIMM slot and see
> > if anything changes?
> >
> > Can you also provide the motherboard model? Also, do you
> > have multiple CPUs installed in this system?
> >
> > Best regards,
> > Richard Gallamore
> >
> > On Mon, Jun 20, 2022 at 5:41 PM Larry Rosenman <ler@lerctr.org> wrote:
> >
> > Yes and Yes.
> >
> > On 06/20/2022 7:37 pm, Ultima wrote:
> >
> > Are you sure that the module you replaced it with was good?
> > Are you sure you replaced the correct module?
> >
> > Best regards,
> > Richard Gallamore
> >
> > On Mon, Jun 20, 2022 at 5:23 PM Larry Rosenman <ler@lerctr.org> wrote:
> >
> > I'm seeing them constantly:
> >
> > root@freenas[~]# mcelog --dmi
> > Hardware event. This is not a software error.
> > MCE 0
> > CPU 22 BANK 8 TSC 20aab486464a
> > MISC ac29890200046444 ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 44
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > WARNING: SMBIOS data is often unreliable. Take with a grain of sa= lt!
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > Hardware event. This is not a software error.
> > MCE 1
> > CPU 22 BANK 8 TSC 296dfcc82582
> > MISC ac29890200041381 ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 81
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > Hardware event. This is not a software error.
> > MCE 2
> > CPU 22 BANK 8 TSC 2a5604a6a070
> > MISC ac29890200044281
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory ECC error occurred during scrub
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 81
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 88000040000200cf MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > Hardware event. This is not a software error.
> > MCE 3
> > CPU 22 BANK 8 TSC 31e141418eb8
> > MISC ac29890200046a4a ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 4a
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > Hardware event. This is not a software error.
> > MCE 4
> > CPU 22 BANK 8 TSC 3a014afee106
> > MISC ac29890200046646 ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 46
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > Hardware event. This is not a software error.
> > MCE 5
> > CPU 22 BANK 8 TSC 41d1dbef1a6a
> > MISC ac29890200046141 ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 41
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > Hardware event. This is not a software error.
> > MCE 6
> > CPU 22 BANK 8 TSC 4a1b1ecef446
> > MISC ac29890200046a4a ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 4a
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > Hardware event. This is not a software error.
> > MCE 7
> > CPU 22 BANK 8 TSC 527bc27db776
> > MISC ac29890200040386 ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 86
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > Hardware event. This is not a software error.
> > MCE 8
> > CPU 22 BANK 8 TSC 5aa4ecdd795a
> > MISC ac29890200046646 ADDR ee2f6e800
> > TIME 1655770989 Mon Jun 20 19:23:09 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 46
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> > root@freenas[~]#
> >
> > and I replaced the DIMM yesterday :(
> >
> > On 06/20/2022 7:19 pm, Ultima wrote:
> >
> > Hey Larry,
> >
> > It is possible it's the motherboard itself, but it's rare= . The way I
> > would determine this is to swap the DIMM module with another
> > populated slot on the motherboard and see if the error migrated > > to the new slot or not. Also, this error doesn't necessarily = mean
> > there is a problem that needs to be addressed. If you have been > > running the system for many months and you see ECC errors a
> > handful of times, it can probably be safely ignored.
> >
> > Best regards,
> > Richard Gallamore
> >
> > On Mon, Jun 20, 2022 at 3:14 PM Larry Rosenman <ler@lerctr.org> wrote:
> > I've gotten a BUNCH of these on my TrueNAS server.=C2=A0 I= 9;ve replaced this
> > DIMM a couple of times, and still the MCE's continue.
> > Is it possible it's Motherboard slot issue?
> >
> > Hardware event. This is not a software error.
> > MCE 8
> > CPU 22 BANK 8 TSC 5aa4ecdd795a
> > MISC ac29890200046646 ADDR ee2f6e800
> > TIME 1655762472 Mon Jun 20 17:01:12 2022
> > MCG status:
> > Memory read ECC error
> > Memory corrected error count (CORE_ERR_CNT): 1
> > Memory transaction Tracker ID (RTId): 46
> > Memory DIMM ID of error: 0
> > Memory channel ID of error: 1
> > Memory ECC syndrome: ac298902
> > STATUS 8c0000400001009f MCGSTATUS 0
> > MCGCAP 1c09 APICID 34 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44 Step 2
> > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB
> > Device Locator: P2-DIMM2C
> > Bank Locator: BANK14
> > Manufacturer: Hyundai
> > Serial Number: 40F3C20F
> > Asset Tag:
> > Part Number: HMT151R7BFR4C-H9
> >
> > --
> > Larry Rosenman=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0http://www.lerctr.org/~ler
> > Phone: +1 214-642-9640=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0E-Mail: ler@lerctr.org
> > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
>
> --
> Larry Rosenman=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0http://www.lerctr.org/~ler
> Phone: +1 214-642-9640=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0E-Mail: l= er@lerctr.org
> US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
>
> --
> Larry Rosenman=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0http://www.lerctr.org/~ler
> Phone: +1 214-642-9640=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0E-Mail: l= er@lerctr.org
> US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
>
> --
> Larry Rosenman=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0http://www.lerctr.org/~ler
> Phone: +1 214-642-9640=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0E-Mail: l= er@lerctr.org
> US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106
>
> --
> Larry Rosenman=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0http://www.lerctr.org/~ler
> Phone: +1 214-642-9640=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0E-Mail: l= er@lerctr.org
> US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106

--
Rod Grimes=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rgrimes@freebsd.org
--000000000000b43f7505e1f80db6--