Page fault in _mca_init during startup

Konstantin Belousov kostikbel at gmail.com
Thu Feb 4 23:16:44 UTC 2021


On Thu, Feb 04, 2021 at 04:05:42PM -0700, Alan Somers wrote:
> On Thu, Feb 4, 2021 at 3:58 PM Konstantin Belousov <kostikbel at gmail.com>
> wrote:
> 
> > On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote:
> > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <asomers at freebsd.org> wrote:
> > > >
> > > > After upgrading a machine to FreeBSD, 12.2, it hit the following panic
> > on
> > > > its first reboot.  I suspect that a few other servers have hit this
> > too,
> > > > but since it happens before swap is mounted there are no core dumps,
> > and
> > > > they usually reboot immediately.  The code in question hasn't changed
> > since
> > > > 2018.  The panic happened in cmci_monitor at line 930.  Does anybody
> > have
> > > > any suggestions for how I could debug further?  I can't readily
> > reproduce
> > > > it, and I can't dump core, but I'd like to investigate it any way I
> > can.
> > > > The server in question has dual Xeon Gold 6142 CPUs.
> > > >
> > >
> > > I can't actually help :( but I can add a +1  with similar hardware or
> > > equivalent specs. It's not frequent, but it's often enough to be
> > > annoying.
> > > -M
> > >
> > > > if (!(ctl & MC_CTL2_CMCI_EN))
> > > > /* This bank does not support CMCI. */
> > > > return;
> > > >
> > > > cc = &cmc_state[PCPU_GET(cpuid)][i];    // <- panic here
> > > >
> > > > /* Determine maximum threshold. */
> > > >
> > > >
> > > > Fatal trap 12: page fault while in kernel mode
> > > > cpuid = 26; apic id = 34
> > > > fault virtual address = 0xd0
> > > > fault code = supervisor read data, page not present
> > > > instruction pointer = 0x20:0xffffffff8125a009
> > > > stack pointer        = 0x28:0xfffffe0000b65f20
> > > > frame pointer        = 0x28:0xfffffe0000b65f50
> > > > code segment = base 0x0, limit 0xfffff, type 0x1b
> > > > = DPL 0, pres 1, long 1, def32 0, gran 1
> > > > processor eflags = resume, IOPL = 0
> > > > current process = 11 (idle: cpu26)
> > > > trap number = 12
> > > > panic: page fault
> > > > cpuid = 26
> > > > time = 1
> > > > KDB: stack backtrace:
> > > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> > > > 0xfffffe0000b65be0
> > > > vpanic() at vpanic+0x17b/frame 0xfffffe0000b65c30
> > > > panic() at panic+0x43/frame 0xfffffe0000b65c90
> > > > trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000b65cf0
> > > > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000b65d40
> > > > trap() at trap+0x286/frame 0xfffffe0000b65e50
> > > > calltrap() at calltrap+0x8/frame 0xfffffe0000b65e50
> > > > --- trap 0xc, rip = 0xffffffff8125a009, rsp = 0xfffffe0000b65f20, rbp =
> > > > 0xfffffe0000b65f50 ---
> > > > _mca_init() at _mca_init+0x5d9/frame 0xfffffe0000b65f50
> > > > init_secondary_tail() at init_secondary_tail+0xfd/frame
> > 0xfffffe0000b65f80
> > > > init_secondary() at init_secondary+0x2d1/frame 0xfffffe0000b65ff0
> > > > KDB: enter: panic
> > > > [ thread pid 11 tid 100029 ]
> > > > Stopped at      kdb_enter+0x37: movq    $0,0x12bc1f6(%rip)
> >
> > Try this.
> >
> > I think that there is no other dependencies in the startup order, but
> > cannot know it for sure.
> >
> > commit 19584e3d3e9606d591fa30999b370ed758960e8c
> > Author: Konstantin Belousov <kib at FreeBSD.org>
> > Date:   Fri Feb 5 00:56:09 2021 +0200
> >
> >     x86: init mca before APs are started
> >
> > diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c
> > index 03100e77d455..e2bf2673cf69 100644
> > --- a/sys/x86/x86/mca.c
> > +++ b/sys/x86/x86/mca.c
> > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused)
> >
> >         mca_init();
> >  }
> > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL);
> > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, NULL);
> >
> >  /* Called when a machine check exception fires. */
> >  void
> >
> 
> I can test this patch on development servers, but so far I've only seen the
> crash on production servers.  Do you have any suggestions for how to force
> the crash, or how to test this patch besides simply making sure that my dev
> servers can boot?

The race, as I see it, is that we call mca_init() on BSP too late, so
malloc() that provides the storage for cmc_state array, could be called
too late, before one of the APs was IPIed for startup.

Patch ensures that mca_init_bsp() SYSINIT is finished before we go to
start the APs.

I do not think there is any reliable way to trigger the panic while keeping
the patch usable, except to observe enough successfull boots.


More information about the freebsd-stable mailing list