Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

From: bob prohaska <fbsd_at_www.zefox.net>
Date: Tue, 08 Mar 2022 15:42:04 UTC
On Mon, Mar 07, 2022 at 11:45:02AM -0500, Mark Johnston wrote:
> On Mon, Mar 07, 2022 at 04:25:22PM +0000, Andrew Turner wrote:
> > 
> > > On 7 Mar 2022, at 15:13, Mark Johnston <markj@freebsd.org> wrote:
> > > ...
> > > A (the?) problem is that the compiler is treating "pc" as an alias
> > > for x18, but the rmlock code assumes that the pcpu pointer is loaded
> > > once, as it dereferences "pc" outside of the critical section.  On
> > > arm64, if a context switch occurs between the store at _rm_rlock+144 and
> > > the load at +152, and the thread is migrated to another CPU, then we'll
> > > end up using the wrong CPU ID in the rm->rm_writecpus test.
> > > 
> > > I suspect the problem is unique to arm64 as its get_pcpu()
> > > implementation is different from the others in that it doesn't use
> > > volatile-qualified inline assembly.  This has been the case since
> > > https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762 <https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762>
> > > .
> > > 
> > > I haven't been able to reproduce any crashes running poudriere in an
> > > arm64 AWS instance, though.  Could you please try the patch below and
> > > confirm whether it fixes your panics?  I verified that the apparent
> > > problem described above is gone with the patch.
> > 
> > Alternatively (or additionally) we could do something like the following. There are only a few MI users of get_pcpu with the main place being in rm locks.
> > 
> > diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
> > index 09f6361c651c..59b890e5c2ea 100644
> > --- a/sys/arm64/include/pcpu.h
> > +++ b/sys/arm64/include/pcpu.h
> > @@ -58,7 +58,14 @@ struct pcpu;
> > 
> >  register struct pcpu *pcpup __asm ("x18");
> > 
> > -#define        get_pcpu()      pcpup
> > +static inline struct pcpu *
> > +get_pcpu(void)
> > +{
> > +       struct pcpu *pcpu;
> > +
> > +       __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
> > +       return (pcpu);
> > +}
> > 
> >  static inline struct thread *
> >  get_curthread(void)
> 
> Indeed, I think this is probably the best solution.

Just for fun I tried the patch on a Pi3 running -current, updated a day or two
prior. The patch applied, compiled and seemed to run acceptably, but when I 
left a -j2 -DWITH_META_MODE buildworld running it crashed overnight, reporting


login: panic: rm_rlock: recursed on non-recursive rmlock sysctl lock @ /usr/src/sys/kern/kern_sysctl.c:193

cpuid = 0
time = 1646720264
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
_rm_rlock_debug() at _rm_rlock_debug+0x214
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x140
sysctl_root() at sysctl_root+0x1ac
userland_sysctl() at userland_sysctl+0x140
sys___sysctl() at sys___sysctl+0x68
do_el0_sync() at do_el0_sync+0x520
handle_el0_sync() at handle_el0_sync+0x40
--- exception, esr 0x56000000
KDB: enter: panic
[ thread pid 869 tid 100091 ]
Stopped at      kdb_enter+0x44: undefined       f902011f


I tried typing bt at the debugger prompt but got no more output. 

I've put the buildworld log file at
http://www.zefox.net/~fbsd/rpi3/crashes/20220307/

Hope this is of some use....

bob prohaska