Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot

From: Warner Losh <imp_at_bsdimp.com>
Date: Sun, 23 May 2021 05:16:02 UTC
On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm <
freebsd-arm@freebsd.org> wrote:

> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
> # diff -r /usr/ports/ /mnt/ | more
> nvme0: cpl does not map to outstanding cmd
> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
> panic: received completion for unknown cmd
>

cid 0x7e has no currently active command. The cid is used by the driver
to map completions back to requests.

So, there's usually 3 possibilities that I've seen this with.

(1) There's a missing cache flush so you get a bogus cpl back because
something stale
was read. It's unlikely to be this one because the rest of this look like a
successful
command completed: sc = 0 is successful completion and sct is a generic
command queued.

(2) We're looking at the completion record twice because we failed to
properly update the
head pointer and we've already completed the command. I've only ever seen
this in a
panic situation where we interrupt the completion routine because something
else
paniced.

(3) There's something that's corrupting the act_tr array in the qpair. I've
not seen this,
but if something else smashes that area (zeroing it in this case), then
that could cause
an error like this.

Or it could be something new I've not seen nor thought about before.


> cpuid = 3
> time = 1621743752
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x188
> panic() at panic+0x44
> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc
> nvme_timeout() at nvme_timeout+0x3c
> softclock_call_cc() at softclock_call_cc+0x124
> softclock() at softclock+0x60
> ithread_loop() at ithread_loop+0x2a8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 12 tid 100028 ]
> Stopped at      kdb_enter+0x48: undefined       f904411f
> db>
>
> Based on the "nvme" references, I expect this is tied to
> handling the Optane 480 GiByte that is in the PCIe slot
> and is the boot/only media for the machine doing the diff.
>
> "db> dump" seems to have worked.
>
> After the reboot, zpool scrub found no errors.
>
> For reference:
>
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1
> main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021
>  root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72
> arm64 aarch64 1400013 1400013
>

If you have the dump, I suggest starting to make sure that the act_tr array
looks sane. Make
sure all the live pointers point to a sane looking tr. Make sure that tr is
on the active list, etc

It will take a fair amount of driver reading, though, to see how we got
here. I'd also check to
make sure that qpair->num_enttries > cpl.cid (0x3e in this case).

Warner