Re: BINIT and BERR signals in MCA

From: Lee MATTHEWS <Lee.MATTHEWS.external_at_stormshield.eu>
Date: Tue, 11 Apr 2023 14:28:35 UTC
Thanks for getting back to me Eugene.


On the two cores that I've received, they seem to die at the same point :


#4  0xffffffff8049a9e3 in panic (fmt=<unavailable>) at ../../../kern/kern_shutdown.c:714
#5  0xffffffff80780a2b in mca_intr () at ../../../x86/x86/mca.c:1193
#6  <signal handler called>
#7  smp_rendezvous_action () at ../../../kern/subr_smp.c:417
#8  0xffffffff804e5f79 in smp_rendezvous_cpus (map=...,
    setup_func=0xffffffff804e5e40 <smp_no_rendezvous_barrier>,
    action_func=0xffffffff80496730 <rm_cleanIPI>,
    teardown_func=0xffffffff804e5e40 <smp_no_rendezvous_barrier>, arg=0xffffffff80cb5048 <g_conf_lock>)
    at ../../../kern/subr_smp.c:554
#9  0xffffffff80496639 in _rm_wlock (rm=0xffffffff80cb5048 <g_conf_lock>)
    at ../../../kern/kern_rmlock.c:551


Do you think the temperature could still be an issue? If it were temperature related, could one not expect the MCA interrupt to occur during other function calls?


I've asked for a log of the CPU temperatures, I'll write back when I get them.


Lee

________________________________
From: Eugene Grosbein <eugen@grosbein.net>
Sent: 11 April 2023 13:59:08
To: Lee MATTHEWS; freebsd-hackers@FreeBSD.org
Subject: Re: BINIT and BERR signals in MCA

11.04.2023 18:45, Lee MATTHEWS wrote:

> Hello,
>
> One of our clients is experiencing problems using one of our products. It runs FreeBSD 11.3 on an Intel Atom Apollo Lake E3930 two core SoC processor.
>
> Occasionally, under very light load, the kernel will panic. I've managed to get a couple of vmcores and I notice via the backtrace that the MCA interrupt is called.
>
> I've managed to recover two vmcores and I notice in both of them that the Inter-Processor Interrupts are not being transferred from one CPU to the other. I've also noticed that the structure mca_internal contains information concerning the state of the MCA status register (value : 0x9000000020000003) for bank 0.
>
>>From Intel's software architecture document, the MCA Error Code is 0x0003 "The BINIT# from another processor caused this processor to enter machine check." and the Model Specific Error Code is 0x2000 "1 if BERR is driven."
>
> The Intel document is not clear; could anyone please explain what the BINIT and BERR signals mean? They appear to be related to a bus, but I'm not sure which one. A bus external to the Atom SoC or one of the internal buses within the Atom SoC?
>
> Do you have any ideas of what could generate this type of error? Is it likely a hardware or a software issue?
>
> Thanks in advance.
>
> Best wishes,
> Lee Matthews

I believe this is some hardware issue, probably over-heating. Did you check for thermal sensor values?