From nobody Sun May 23 05:16:02 2021 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 8BDD29E2C30 for ; Sun, 23 May 2021 05:16:14 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-qt1-x834.google.com (mail-qt1-x834.google.com [IPv6:2607:f8b0:4864:20::834]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4FnpTp3NgYz4nM7 for ; Sun, 23 May 2021 05:16:14 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-qt1-x834.google.com with SMTP id g8so10858054qtp.4 for ; Sat, 22 May 2021 22:16:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=SvAMoHMEhRAJ8ppLYI+d4H+hpNUZbR1nIyYxsviDHYw=; b=u0QacXdc47LNhYULjhq03U4Sk5T8dBvpsSSgDN0rG3xUy6ofvZsB2RfGwc2nmHOVT0 yV8D4YCM+H6ToUyPLCI/eoiV/rVKKCHwRAqQ+FjjEMfzBiR7nk1CL4wP0QQCmJgJPJyI iqq9/ucN8AFtcvBHFmIH5OFDse6ZROqf69ssIKo2yUirvdoRsFcxHKNvwnYT882BoM2g 17tJehrN80JagRHUkt4hwMImlo8WhF8OjphqcBHcL7as/KbLBWOu8IB7Xss23uUiZNOU afTdz9z2lsfGK8u+S3/zYfPQ74CI2afQur/S/aG0FEagCaWgmHK7VFgponhvMzruNNd6 1IBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=SvAMoHMEhRAJ8ppLYI+d4H+hpNUZbR1nIyYxsviDHYw=; b=iuvhDwhyrKO2lrq6TohEV0mhb/qgVNXwZ4A2yBv585cvYGXXciumu16s8kTuQkfOQp vqb8YweGLMSrW0x/Zdc4MC84HbOJVEtFynskXvmWC5JpLm6ZhIeBASouFK0skkN6xEPF Id8mtOFo3Fz+jiVrJ5FGaZ0epBQ3cFaD8Jn5Bkk93trB8GRA5SHbUH5ZgfqxjLIImz82 wlySQiapAdzbjw6nVtRnHsD3HYHwDqss31lfMPulsUDuW0i5+8XlGFmvl1z49p5Y5eRB Elomh7PNAeHiCsDL41F5l1AMWCIQCSTel+xQoNF26MC0IiwYHYEi7eK6G0g1Us0m17lz TxPg== X-Gm-Message-State: AOAM531kzoLJQnKFo3wQs5hdCEFs0iDLu4iewrb4tYDOcudL5jap5zah HpHymk/DQwvO91VElgLBhquMXuVRKcHk3G8448mi5g== X-Google-Smtp-Source: ABdhPJwBiXieBiGdGe56bKJR7j0tD3HXlMl7talNfyMNUVHpYnJV7ehZUnbJ7ls/3RZ2bXsFFAZEvKAImpqMjQ7NRCE= X-Received: by 2002:ac8:580b:: with SMTP id g11mr20068335qtg.244.1621746973337; Sat, 22 May 2021 22:16:13 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <063D5E36-126F-497C-97AF-827BADC1ED2F.ref@yahoo.com> <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com> In-Reply-To: <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com> From: Warner Losh Date: Sat, 22 May 2021 23:16:02 -0600 Message-ID: Subject: Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot To: Mark Millard Cc: freebsd-current , freebsd-arm Content-Type: multipart/alternative; boundary="00000000000042ff8605c2f864e8" X-Rspamd-Queue-Id: 4FnpTp3NgYz4nM7 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] --00000000000042ff8605c2f864e8 Content-Type: text/plain; charset="UTF-8" On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm < freebsd-arm@freebsd.org> wrote: > # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/ > # diff -r /usr/ports/ /mnt/ | more > nvme0: cpl does not map to outstanding cmd > cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0 > panic: received completion for unknown cmd > cid 0x7e has no currently active command. The cid is used by the driver to map completions back to requests. So, there's usually 3 possibilities that I've seen this with. (1) There's a missing cache flush so you get a bogus cpl back because something stale was read. It's unlikely to be this one because the rest of this look like a successful command completed: sc = 0 is successful completion and sct is a generic command queued. (2) We're looking at the completion record twice because we failed to properly update the head pointer and we've already completed the command. I've only ever seen this in a panic situation where we interrupt the completion routine because something else paniced. (3) There's something that's corrupting the act_tr array in the qpair. I've not seen this, but if something else smashes that area (zeroing it in this case), then that could cause an error like this. Or it could be something new I've not seen nor thought about before. > cpuid = 3 > time = 1621743752 > KDB: stack backtrace: > db_trace_self() at db_trace_self > db_trace_self_wrapper() at db_trace_self_wrapper+0x30 > vpanic() at vpanic+0x188 > panic() at panic+0x44 > nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc > nvme_timeout() at nvme_timeout+0x3c > softclock_call_cc() at softclock_call_cc+0x124 > softclock() at softclock+0x60 > ithread_loop() at ithread_loop+0x2a8 > fork_exit() at fork_exit+0x74 > fork_trampoline() at fork_trampoline+0x14 > KDB: enter: panic > [ thread pid 12 tid 100028 ] > Stopped at kdb_enter+0x48: undefined f904411f > db> > > Based on the "nvme" references, I expect this is tied to > handling the Optane 480 GiByte that is in the PCIe slot > and is the boot/only media for the machine doing the diff. > > "db> dump" seems to have worked. > > After the reboot, zpool scrub found no errors. > > For reference: > > # uname -apKU > FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 > main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 > root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72 > arm64 aarch64 1400013 1400013 > If you have the dump, I suggest starting to make sure that the act_tr array looks sane. Make sure all the live pointers point to a sane looking tr. Make sure that tr is on the active list, etc It will take a fair amount of driver reading, though, to see how we got here. I'd also check to make sure that qpair->num_enttries > cpl.cid (0x3e in this case). Warner --00000000000042ff8605c2f864e8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Sat, May 22, 2021 at 10:44 PM Mark= Millard via freebsd-arm <fre= ebsd-arm@freebsd.org> wrote:
# mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
# diff -r /usr/ports/ /mnt/ | more
nvme0: cpl does not map to outstanding cmd
cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
panic: received completion for unknown cmd

<= div>cid 0x7e has no currently active command. The cid is used by the driver=
to map completions=C2=A0back to requests.

So, there's usually 3 possibilities that I've seen this with.

(1) There's a missing cache flush so you get a b= ogus cpl back because something=C2=A0stale
was read. It's unl= ikely=C2=A0to be this one because the rest of this look like a successful
command completed: sc =3D 0 is successful completion and sct is a = generic command queued.

(2) We're looking at t= he completion record twice because=C2=A0we failed to properly update the
head pointer and we've already completed the command. I've = only ever seen this in a
panic situation where we interrupt the c= ompletion routine because something else
paniced.

<= /div>
(3) There's something that's corrupting the act_tr array = in the qpair. I've not seen this,
but if something else smash= es that area (zeroing it in this case), then that could cause
an = error like this.

Or it could be something new I= 9;ve not seen nor thought about before.
=C2=A0
cpuid =3D 3
time =3D 1621743752
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x188
panic() at panic+0x44
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc nvme_timeout() at nvme_timeout+0x3c
softclock_call_cc() at softclock_call_cc+0x124
softclock() at softclock+0x60
ithread_loop() at ithread_loop+0x2a8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 12 tid 100028 ]
Stopped at=C2=A0 =C2=A0 =C2=A0 kdb_enter+0x48: undefined=C2=A0 =C2=A0 =C2= =A0 =C2=A0f904411f
db>

Based on the "nvme" references, I expect this is tied to
handling the Optane 480 GiByte that is in the PCIe slot
and is the boot/only media for the machine doing the diff.

"db> dump" seems to have worked.

After the reboot, zpool scrub found no errors.

For reference:

# uname -apKU
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b= 0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021=C2=A0 =C2=A0 =C2=A0root@CA72_= 16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys= /GENERIC-DBG-CA72=C2=A0 arm64 aarch64 1400013 1400013
=
If you have the dump, I suggest starting to make sure that t= he act_tr array looks sane. Make
sure all the live pointers point= to a sane looking tr. Make sure that tr is on the active list, etc

It will take a fair amount of driver reading, though, to = see how we got here. I'd also check to
make sure that qpair-&= gt;num_enttries > cpl.cid (0x3e in this case).

= Warner
--00000000000042ff8605c2f864e8--