Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

From: Andrew Turner <andrew_at_fubar.geek.nz>
Date: Mon, 07 Mar 2022 16:25:22 UTC
> On 7 Mar 2022, at 15:13, Mark Johnston <markj@freebsd.org> wrote:
> ...
> A (the?) problem is that the compiler is treating "pc" as an alias
> for x18, but the rmlock code assumes that the pcpu pointer is loaded
> once, as it dereferences "pc" outside of the critical section.  On
> arm64, if a context switch occurs between the store at _rm_rlock+144 and
> the load at +152, and the thread is migrated to another CPU, then we'll
> end up using the wrong CPU ID in the rm->rm_writecpus test.
> 
> I suspect the problem is unique to arm64 as its get_pcpu()
> implementation is different from the others in that it doesn't use
> volatile-qualified inline assembly.  This has been the case since
> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762 <https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762>
> .
> 
> I haven't been able to reproduce any crashes running poudriere in an
> arm64 AWS instance, though.  Could you please try the patch below and
> confirm whether it fixes your panics?  I verified that the apparent
> problem described above is gone with the patch.

Alternatively (or additionally) we could do something like the following. There are only a few MI users of get_pcpu with the main place being in rm locks.

diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
index 09f6361c651c..59b890e5c2ea 100644
--- a/sys/arm64/include/pcpu.h
+++ b/sys/arm64/include/pcpu.h
@@ -58,7 +58,14 @@ struct pcpu;

 register struct pcpu *pcpup __asm ("x18");

-#define        get_pcpu()      pcpup
+static inline struct pcpu *
+get_pcpu(void)
+{
+       struct pcpu *pcpu;
+
+       __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
+       return (pcpu);
+}

 static inline struct thread *
 get_curthread(void)

Andrew