From nobody Sun May 23 05:16:02 2021
X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 8BDD29E2C30
	for <freebsd-current@mlmmj.nyi.freebsd.org>; Sun, 23 May 2021 05:16:14 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: from mail-qt1-x834.google.com (mail-qt1-x834.google.com [IPv6:2607:f8b0:4864:20::834])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4FnpTp3NgYz4nM7
	for <freebsd-current@freebsd.org>; Sun, 23 May 2021 05:16:14 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: by mail-qt1-x834.google.com with SMTP id g8so10858054qtp.4
        for <freebsd-current@freebsd.org>; Sat, 22 May 2021 22:16:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=SvAMoHMEhRAJ8ppLYI+d4H+hpNUZbR1nIyYxsviDHYw=;
        b=u0QacXdc47LNhYULjhq03U4Sk5T8dBvpsSSgDN0rG3xUy6ofvZsB2RfGwc2nmHOVT0
         yV8D4YCM+H6ToUyPLCI/eoiV/rVKKCHwRAqQ+FjjEMfzBiR7nk1CL4wP0QQCmJgJPJyI
         iqq9/ucN8AFtcvBHFmIH5OFDse6ZROqf69ssIKo2yUirvdoRsFcxHKNvwnYT882BoM2g
         17tJehrN80JagRHUkt4hwMImlo8WhF8OjphqcBHcL7as/KbLBWOu8IB7Xss23uUiZNOU
         afTdz9z2lsfGK8u+S3/zYfPQ74CI2afQur/S/aG0FEagCaWgmHK7VFgponhvMzruNNd6
         1IBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=SvAMoHMEhRAJ8ppLYI+d4H+hpNUZbR1nIyYxsviDHYw=;
        b=iuvhDwhyrKO2lrq6TohEV0mhb/qgVNXwZ4A2yBv585cvYGXXciumu16s8kTuQkfOQp
         vqb8YweGLMSrW0x/Zdc4MC84HbOJVEtFynskXvmWC5JpLm6ZhIeBASouFK0skkN6xEPF
         Id8mtOFo3Fz+jiVrJ5FGaZ0epBQ3cFaD8Jn5Bkk93trB8GRA5SHbUH5ZgfqxjLIImz82
         wlySQiapAdzbjw6nVtRnHsD3HYHwDqss31lfMPulsUDuW0i5+8XlGFmvl1z49p5Y5eRB
         Elomh7PNAeHiCsDL41F5l1AMWCIQCSTel+xQoNF26MC0IiwYHYEi7eK6G0g1Us0m17lz
         TxPg==
X-Gm-Message-State: AOAM531kzoLJQnKFo3wQs5hdCEFs0iDLu4iewrb4tYDOcudL5jap5zah
	HpHymk/DQwvO91VElgLBhquMXuVRKcHk3G8448mi5g==
X-Google-Smtp-Source: ABdhPJwBiXieBiGdGe56bKJR7j0tD3HXlMl7talNfyMNUVHpYnJV7ehZUnbJ7ls/3RZ2bXsFFAZEvKAImpqMjQ7NRCE=
X-Received: by 2002:ac8:580b:: with SMTP id g11mr20068335qtg.244.1621746973337;
 Sat, 22 May 2021 22:16:13 -0700 (PDT)
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
MIME-Version: 1.0
References: <063D5E36-126F-497C-97AF-827BADC1ED2F.ref@yahoo.com> <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com>
In-Reply-To: <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 22 May 2021 23:16:02 -0600
Message-ID: <CANCZdfqcrkC=VAuvB_i=sR=7rp_e8KeVX=y2apXj7E5De9eG8g@mail.gmail.com>
Subject: Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on
 a MACHIATObin Double Shot
To: Mark Millard <marklmi@yahoo.com>
Cc: freebsd-current <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org>
Content-Type: multipart/alternative; boundary="00000000000042ff8605c2f864e8"
X-Rspamd-Queue-Id: 4FnpTp3NgYz4nM7
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org;
	none
X-Spamd-Result: default: False [-4.00 / 15.00];
	 REPLY(-4.00)[]

--00000000000042ff8605c2f864e8
Content-Type: text/plain; charset="UTF-8"

On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm <
freebsd-arm@freebsd.org> wrote:

> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
> # diff -r /usr/ports/ /mnt/ | more
> nvme0: cpl does not map to outstanding cmd
> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
> panic: received completion for unknown cmd
>

cid 0x7e has no currently active command. The cid is used by the driver
to map completions back to requests.

So, there's usually 3 possibilities that I've seen this with.

(1) There's a missing cache flush so you get a bogus cpl back because
something stale
was read. It's unlikely to be this one because the rest of this look like a
successful
command completed: sc = 0 is successful completion and sct is a generic
command queued.

(2) We're looking at the completion record twice because we failed to
properly update the
head pointer and we've already completed the command. I've only ever seen
this in a
panic situation where we interrupt the completion routine because something
else
paniced.

(3) There's something that's corrupting the act_tr array in the qpair. I've
not seen this,
but if something else smashes that area (zeroing it in this case), then
that could cause
an error like this.

Or it could be something new I've not seen nor thought about before.


> cpuid = 3
> time = 1621743752
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x188
> panic() at panic+0x44
> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc
> nvme_timeout() at nvme_timeout+0x3c
> softclock_call_cc() at softclock_call_cc+0x124
> softclock() at softclock+0x60
> ithread_loop() at ithread_loop+0x2a8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 12 tid 100028 ]
> Stopped at      kdb_enter+0x48: undefined       f904411f
> db>
>
> Based on the "nvme" references, I expect this is tied to
> handling the Optane 480 GiByte that is in the PCIe slot
> and is the boot/only media for the machine doing the diff.
>
> "db> dump" seems to have worked.
>
> After the reboot, zpool scrub found no errors.
>
> For reference:
>
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1
> main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021
>  root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72
> arm64 aarch64 1400013 1400013
>

If you have the dump, I suggest starting to make sure that the act_tr array
looks sane. Make
sure all the live pointers point to a sane looking tr. Make sure that tr is
on the active list, etc

It will take a fair amount of driver reading, though, to see how we got
here. I'd also check to
make sure that qpair->num_enttries > cpl.cid (0x3e in this case).

Warner

--00000000000042ff8605c2f864e8
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Sat, May 22, 2021 at 10:44 PM Mark=
 Millard via freebsd-arm &lt;<a href=3D"mailto:freebsd-arm@freebsd.org">fre=
ebsd-arm@freebsd.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex"># mount -onoatime 192.168.1.187:/usr/ports/ /mnt/<br>
# diff -r /usr/ports/ /mnt/ | more<br>
nvme0: cpl does not map to outstanding cmd<br>
cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0<br>
panic: received completion for unknown cmd<br></blockquote><div><br></div><=
div>cid 0x7e has no currently active command. The cid is used by the driver=
</div><div>to map completions=C2=A0back to requests.</div><div><br></div><d=
iv>So, there&#39;s usually 3 possibilities that I&#39;ve seen this with.</d=
iv><div><br></div><div>(1) There&#39;s a missing cache flush so you get a b=
ogus cpl back because something=C2=A0stale</div><div>was read. It&#39;s unl=
ikely=C2=A0to be this one because the rest of this look like a successful</=
div><div>command completed: sc =3D 0 is successful completion and sct is a =
generic command queued.</div><div><br></div><div>(2) We&#39;re looking at t=
he completion record twice because=C2=A0we failed to properly update the</d=
iv><div>head pointer and we&#39;ve already completed the command. I&#39;ve =
only ever seen this in a</div><div>panic situation where we interrupt the c=
ompletion routine because something else</div><div>paniced.</div><div><br><=
/div><div>(3) There&#39;s something that&#39;s corrupting the act_tr array =
in the qpair. I&#39;ve not seen this,</div><div>but if something else smash=
es that area (zeroing it in this case), then that could cause</div><div>an =
error like this.</div><div><br></div><div>Or it could be something new I=
9;ve not seen nor thought about before.</div><div>=C2=A0<br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">
cpuid =3D 3<br>
time =3D 1621743752<br>
KDB: stack backtrace:<br>
db_trace_self() at db_trace_self<br>
db_trace_self_wrapper() at db_trace_self_wrapper+0x30<br>
vpanic() at vpanic+0x188<br>
panic() at panic+0x44<br>
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc<br=
>
nvme_timeout() at nvme_timeout+0x3c<br>
softclock_call_cc() at softclock_call_cc+0x124<br>
softclock() at softclock+0x60<br>
ithread_loop() at ithread_loop+0x2a8<br>
fork_exit() at fork_exit+0x74<br>
fork_trampoline() at fork_trampoline+0x14<br>
KDB: enter: panic<br>
[ thread pid 12 tid 100028 ]<br>
Stopped at=C2=A0 =C2=A0 =C2=A0 kdb_enter+0x48: undefined=C2=A0 =C2=A0 =C2=
=A0 =C2=A0f904411f<br>
db&gt; <br>
<br>
Based on the &quot;nvme&quot; references, I expect this is tied to<br>
handling the Optane 480 GiByte that is in the PCIe slot<br>
and is the boot/only media for the machine doing the diff.<br>
<br>
&quot;db&gt; dump&quot; seems to have worked.<br>
<br>
After the reboot, zpool scrub found no errors.<br>
<br>
For reference:<br>
<br>
# uname -apKU<br>
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b=
0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021=C2=A0 =C2=A0 =C2=A0root@CA72_=
16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys=
/GENERIC-DBG-CA72=C2=A0 arm64 aarch64 1400013 1400013<br></blockquote><div>=
<br></div><div>If you have the dump, I suggest starting to make sure that t=
he act_tr array looks sane. Make</div><div>sure all the live pointers point=
 to a sane looking tr. Make sure that tr is on the active list, etc</div><d=
iv><br></div><div>It will take a fair amount of driver reading, though, to =
see how we got here. I&#39;d also check to</div><div>make sure that qpair-&=
gt;num_enttries &gt; cpl.cid (0x3e in this case).</div><div><br></div><div>=
Warner</div></div></div>

--00000000000042ff8605c2f864e8--