From nobody Tue Mar 08 12:26:05 2022 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 27C6F1A0E302; Tue, 8 Mar 2022 12:26:14 +0000 (UTC) (envelope-from andrew@fubar.geek.nz) Received: from fry.fubar.geek.nz (fry.fubar.geek.nz [139.59.165.16]) by mx1.freebsd.org (Postfix) with ESMTP id 4KCZLX1LlVz3R6t; Tue, 8 Mar 2022 12:26:11 +0000 (UTC) (envelope-from andrew@fubar.geek.nz) Received: from smtpclient.apple (cpc91232-cmbg18-2-0-cust554.5-4.cable.virginm.net [82.2.126.43]) by fry.fubar.geek.nz (Postfix) with ESMTPSA id B05B04E6E9; Tue, 8 Mar 2022 12:26:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fubar.geek.nz; s=mail; t=1646742370; bh=QlmT9QpM8WqrTLMpNA3WqsHqQTGYbmLrfGhj6giEddI=; h=From:Subject:Date:In-Reply-To:Cc:To:References; b=S69YN5Qr/GurddKIFUXN3AyhqA08qFEDoAhibVnsTlyrvTikcm2FL0lS8xwSRJ05W Kq9rpVAhkNJi4jybZIA0z18zP9T2N3Bg4Jfhr1v8YcUdT9z1WzVNcBt2JvsNdYB/1v OD4iMtdscr4E4UfMB7YuoqWQYG9/9DcuLOS4zwW4zorUQS8ZkhcMZfyPbvlUY/SWGJ JDzhvkIHR5JDB0DIujgrwicZdW5rxolwORGq3j8TZ0LFrIXECkxNGLHEUfL04GIOfA 8wf/fvnhEZeHThURC9AptRmzQFup9LWoNEJYuJ3l5IrkOdkWirS3s+SBpxkTjUojwW kyqn9Xu7qZbdFZBHTMHTFi273qedG3eYneN4NHQeyRRXB3i/XLplTjZaooVB61/3vt /RyPofZt7GWUGk30YVd2XjVJ55F+WcTZc5fMg9de8v5M7ZnTZKhBd+gA/H66eR95qk xCpD8GOm7yfd3Mz734oRIEvzSNPx+abAJ4rgvsYGGBkWWvXvv3Cr/FjKw71pV0x+cz 514JNI+8TL9rwn/F/kbCpU4qU18HWvt53mkYgrCCGKT6Ar3sIdG9BSN7Sk6AsFQmV0 z57I3Khh/mmQcl8fqHfGdrDbK7rS6JZZFBHqJlbz6kLzRjX9kKq7BtmG8qLNLt0nkq tq74XaluCbqW7stcxtWtDnHc= From: Andrew Turner Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_27C8038E-6248-4772-9061-7457E0DFDA9B" List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 15.0 \(3693.60.0.1.1\)) Subject: Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28)) Date: Tue, 8 Mar 2022 12:26:05 +0000 In-Reply-To: Cc: Mark Millard , FreeBSD-STABLE Mailing List , Ronald Klop , bob prohaska , Free BSD , freebsd-current To: Mark Johnston References: <1800459695.1.1646649539521@mailrelay> <132978150.92.1646660769467@mailrelay> <3374E0F8-D712-4ED0-A62B-B6924FC8A5E2@fubar.geek.nz> X-Mailer: Apple Mail (2.3693.60.0.1.1) X-Rspamd-Queue-Id: 4KCZLX1LlVz3R6t X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=fubar.geek.nz header.s=mail header.b=S69YN5Qr; dmarc=pass (policy=none) header.from=fubar.geek.nz; spf=pass (mx1.freebsd.org: domain of andrew@fubar.geek.nz designates 139.59.165.16 as permitted sender) smtp.mailfrom=andrew@fubar.geek.nz X-Spamd-Result: default: False [-3.40 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[fubar.geek.nz:s=mail]; FREEFALL_USER(0.00)[andrew]; FROM_HAS_DN(0.00)[]; MV_CASE(0.50)[]; R_SPF_ALLOW(-0.20)[+mx]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; NEURAL_HAM_LONG(-1.00)[-0.999]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[fubar.geek.nz:+]; DMARC_POLICY_ALLOW(-0.50)[fubar.geek.nz,none]; RCPT_COUNT_SEVEN(0.00)[7]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-0.998]; MLMMJ_DEST(0.00)[freebsd-arm,freebsd-current,freebsd-stable]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:14061, ipnet:139.59.160.0/20, country:US]; FREEMAIL_CC(0.00)[yahoo.com,freebsd.org,klop.ws,www.zefox.net]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N --Apple-Mail=_27C8038E-6248-4772-9061-7457E0DFDA9B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 7 Mar 2022, at 19:04, Mark Johnston wrote: >=20 > On Mon, Mar 07, 2022 at 10:03:51AM -0800, Mark Millard wrote: >>=20 >>=20 >> On 2022-Mar-7, at 08:45, Mark Johnston wrote: >>=20 >>> On Mon, Mar 07, 2022 at 04:25:22PM +0000, Andrew Turner wrote: >>>>=20 >>>>> On 7 Mar 2022, at 15:13, Mark Johnston wrote: >>>>> ... >>>>> A (the?) problem is that the compiler is treating "pc" as an alias >>>>> for x18, but the rmlock code assumes that the pcpu pointer is = loaded >>>>> once, as it dereferences "pc" outside of the critical section. On >>>>> arm64, if a context switch occurs between the store at = _rm_rlock+144 and >>>>> the load at +152, and the thread is migrated to another CPU, then = we'll >>>>> end up using the wrong CPU ID in the rm->rm_writecpus test. >>>>>=20 >>>>> I suspect the problem is unique to arm64 as its get_pcpu() >>>>> implementation is different from the others in that it doesn't use >>>>> volatile-qualified inline assembly. This has been the case since >>>>> = https://cgit.freebsd.org/src/commit/?id=3D63c858a04d56529eddbddf85ad04fc8e= 99e73762 = >>>>> . >>>>>=20 >>>>> I haven't been able to reproduce any crashes running poudriere in = an >>>>> arm64 AWS instance, though. Could you please try the patch below = and >>>>> confirm whether it fixes your panics? I verified that the = apparent >>>>> problem described above is gone with the patch. >>>>=20 >>>> Alternatively (or additionally) we could do something like the = following. There are only a few MI users of get_pcpu with the main place = being in rm locks. >>>>=20 >>>> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h >>>> index 09f6361c651c..59b890e5c2ea 100644 >>>> --- a/sys/arm64/include/pcpu.h >>>> +++ b/sys/arm64/include/pcpu.h >>>> @@ -58,7 +58,14 @@ struct pcpu; >>>>=20 >>>> register struct pcpu *pcpup __asm ("x18"); >>>>=20 >>>> -#define get_pcpu() pcpup >>>> +static inline struct pcpu * >>>> +get_pcpu(void) >>>> +{ >>>> + struct pcpu *pcpu; >>>> + >>>> + __asm __volatile("mov %0, x18" : "=3D&r"(pcpu)); >>>> + return (pcpu); >>>> +} >>>>=20 >>>> static inline struct thread * >>>> get_curthread(void) >>>=20 >>> Indeed, I think this is probably the best solution. I=E2=80=99ve pushed the above to git in ed3066342660 & will MFC in a few = days. >=20 > Thinking a bit more, even with that patch, code like this may not = behave > the same on arm64 as on other platforms: >=20 > critical_enter(); > ptr =3D &PCPU_GET(foo); > critical_exit(); > bar =3D *ptr; >=20 > since as far as I can see the compiler may translate it to >=20 > critical_enter(); > critical_exit(); > bar =3D PCPU_GET(foo); If we think this will be a problem we could change the PCPU_PTR macro to = use get_pcpu again, however I only see two places it=E2=80=99s used in = the MI code in subr_witness.c and kern_clock.c. Neither of these appear = to be problematic from a quick look as there are no critical sections, = although I=E2=80=99m not familiar enough with the code to know for = certain. Andrew= --Apple-Mail=_27C8038E-6248-4772-9061-7457E0DFDA9B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On 7 Mar 2022, at 19:04, Mark Johnston <markj@freebsd.org> = wrote:

On Mon, Mar 07, 2022 at 10:03:51AM -0800, Mark Millard = wrote:


On 2022-Mar-7, at 08:45, Mark Johnston <markj@FreeBSD.org> = wrote:

On Mon, Mar 07, 2022 at 04:25:22PM +0000, Andrew Turner = wrote:

On 7 Mar 2022, at 15:13, = Mark Johnston <markj@freebsd.org> wrote:
...
A (the?) problem is that the compiler is treating "pc" as an = alias
for x18, but the rmlock code assumes that the pcpu = pointer is loaded
once, as it dereferences "pc" outside of = the critical section.  On
arm64, if a context switch = occurs between the store at _rm_rlock+144 and
the load at = +152, and the thread is migrated to another CPU, then we'll
end up using the wrong CPU ID in the rm->rm_writecpus = test.

I suspect the problem is unique to = arm64 as its get_pcpu()
implementation is different from = the others in that it doesn't use
volatile-qualified = inline assembly.  This has been the case since
https://cgit.freebsd.org/src/commit/?id=3D63c858a04d56529eddbdd= f85ad04fc8e99e73762 <https://cgit.freebsd.org/src/commit/?id=3D63c858a04d56529eddbdd= f85ad04fc8e99e73762>
.

I= haven't been able to reproduce any crashes running poudriere in an
arm64 AWS instance, though.  Could you please try the = patch below and
confirm whether it fixes your panics? =  I verified that the apparent
problem described above = is gone with the patch.

Alternatively (or additionally) we could do something like = the following. There are only a few MI users of get_pcpu with the main = place being in rm locks.

diff --git = a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
index = 09f6361c651c..59b890e5c2ea 100644
--- = a/sys/arm64/include/pcpu.h
+++ = b/sys/arm64/include/pcpu.h
@@ -58,7 +58,14 @@ struct = pcpu;

register struct pcpu *pcpup __asm = ("x18");

-#define =        get_pcpu() =      pcpup
+static inline struct = pcpu *
+get_pcpu(void)
+{
+ =       struct pcpu *pcpu;
+
+       __asm __volatile("mov =   %0, x18" : "=3D&r"(pcpu));
+ =       return (pcpu);
+}

static inline struct thread *
get_curthread(void)

Indeed, I think this is probably the best solution.

I=E2=80=99ve pushed the above to git = in ed3066342660 & will MFC in a few days.


Thinking a bit more, even with = that patch, code like this may not behave
the same on arm64 as on other platforms:

critical_enter();
ptr =3D = &PCPU_GET(foo);
critical_exit();
bar =3D *ptr;

since as far as I can see the compiler may translate it = to

critical_enter();
critical_exit();
bar =3D PCPU_GET(foo);

If we think this will be a problem we could change = the PCPU_PTR macro to use get_pcpu again, however I only see two = places it=E2=80=99s used in the MI code in subr_witness.c = and kern_clock.c. Neither of these appear to be problematic from a = quick look as there are no critical sections, although I=E2=80=99m not = familiar enough with the code to know for certain.

Andrew
= --Apple-Mail=_27C8038E-6248-4772-9061-7457E0DFDA9B--