Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))
Date: Tue, 08 Mar 2022 12:26:05 UTC
> On 7 Mar 2022, at 19:04, Mark Johnston <markj@freebsd.org> wrote:
>
> On Mon, Mar 07, 2022 at 10:03:51AM -0800, Mark Millard wrote:
>>
>>
>> On 2022-Mar-7, at 08:45, Mark Johnston <markj@FreeBSD.org> wrote:
>>
>>> On Mon, Mar 07, 2022 at 04:25:22PM +0000, Andrew Turner wrote:
>>>>
>>>>> On 7 Mar 2022, at 15:13, Mark Johnston <markj@freebsd.org> wrote:
>>>>> ...
>>>>> A (the?) problem is that the compiler is treating "pc" as an alias
>>>>> for x18, but the rmlock code assumes that the pcpu pointer is loaded
>>>>> once, as it dereferences "pc" outside of the critical section. On
>>>>> arm64, if a context switch occurs between the store at _rm_rlock+144 and
>>>>> the load at +152, and the thread is migrated to another CPU, then we'll
>>>>> end up using the wrong CPU ID in the rm->rm_writecpus test.
>>>>>
>>>>> I suspect the problem is unique to arm64 as its get_pcpu()
>>>>> implementation is different from the others in that it doesn't use
>>>>> volatile-qualified inline assembly. This has been the case since
>>>>> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762 <https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762>
>>>>> .
>>>>>
>>>>> I haven't been able to reproduce any crashes running poudriere in an
>>>>> arm64 AWS instance, though. Could you please try the patch below and
>>>>> confirm whether it fixes your panics? I verified that the apparent
>>>>> problem described above is gone with the patch.
>>>>
>>>> Alternatively (or additionally) we could do something like the following. There are only a few MI users of get_pcpu with the main place being in rm locks.
>>>>
>>>> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
>>>> index 09f6361c651c..59b890e5c2ea 100644
>>>> --- a/sys/arm64/include/pcpu.h
>>>> +++ b/sys/arm64/include/pcpu.h
>>>> @@ -58,7 +58,14 @@ struct pcpu;
>>>>
>>>> register struct pcpu *pcpup __asm ("x18");
>>>>
>>>> -#define get_pcpu() pcpup
>>>> +static inline struct pcpu *
>>>> +get_pcpu(void)
>>>> +{
>>>> + struct pcpu *pcpu;
>>>> +
>>>> + __asm __volatile("mov %0, x18" : "=&r"(pcpu));
>>>> + return (pcpu);
>>>> +}
>>>>
>>>> static inline struct thread *
>>>> get_curthread(void)
>>>
>>> Indeed, I think this is probably the best solution.
I’ve pushed the above to git in ed3066342660 & will MFC in a few days.
>
> Thinking a bit more, even with that patch, code like this may not behave
> the same on arm64 as on other platforms:
>
> critical_enter();
> ptr = &PCPU_GET(foo);
> critical_exit();
> bar = *ptr;
>
> since as far as I can see the compiler may translate it to
>
> critical_enter();
> critical_exit();
> bar = PCPU_GET(foo);
If we think this will be a problem we could change the PCPU_PTR macro to use get_pcpu again, however I only see two places it’s used in the MI code in subr_witness.c and kern_clock.c. Neither of these appear to be problematic from a quick look as there are no critical sections, although I’m not familiar enough with the code to know for certain.
Andrew