Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot

From: Mark Millard via freebsd-current <freebsd-current_at_freebsd.org>
Date: Sun, 23 May 2021 07:08:20 UTC
On 2021-May-22, at 22:16, Warner Losh <imp at bsdimp.com> wrote:

> On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm <freebsd-arm@freebsd.org> wrote:
> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
> # diff -r /usr/ports/ /mnt/ | more
> nvme0: cpl does not map to outstanding cmd
> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
> panic: received completion for unknown cmd
> 
> cid 0x7e has no currently active command. The cid is used by the driver
> to map completions back to requests.
> 
> So, there's usually 3 possibilities that I've seen this with.
> 
> (1) There's a missing cache flush so you get a bogus cpl back because something stale
> was read. It's unlikely to be this one because the rest of this look like a successful
> command completed: sc = 0 is successful completion and sct is a generic command queued.
> 
> (2) We're looking at the completion record twice because we failed to properly update the
> head pointer and we've already completed the command. I've only ever seen this in a
> panic situation where we interrupt the completion routine because something else
> paniced.
> 
> (3) There's something that's corrupting the act_tr array in the qpair. I've not seen this,
> but if something else smashes that area (zeroing it in this case), then that could cause
> an error like this.

Of note may be that I buildworld and buildkernel with extra
tuning enabled, targeting the cortex-a72. In one past example
this lead to finding a missing synchronization related to XHCI
handling that was fixed. (The fix was not aarch64 specific at
all.) For that: A cortex-a53 did not show the problem with or
without that tuning. A cortex-a72 showed the problem only with
the cortex-a72 tuning, not with targeting a cortex-a53 tuning
or generic armv7, for example.

Not that I've any evidence specifically suggesting such would
be involved here. But it might be good to keep in mind as a
possaibility.

> Or it could be something new I've not seen nor thought about before.
>  
> cpuid = 3
> time = 1621743752
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x188
> panic() at panic+0x44
> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc
> nvme_timeout() at nvme_timeout+0x3c
> softclock_call_cc() at softclock_call_cc+0x124
> softclock() at softclock+0x60
> ithread_loop() at ithread_loop+0x2a8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 12 tid 100028 ]
> Stopped at      kdb_enter+0x48: undefined       f904411f
> db> 
> 
> Based on the "nvme" references, I expect this is tied to
> handling the Optane 480 GiByte that is in the PCIe slot
> and is the boot/only media for the machine doing the diff.
> 
> "db> dump" seems to have worked.
> 
> After the reboot, zpool scrub found no errors.
> 
> For reference:
> 
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72  arm64 aarch64 1400013 1400013
> 
> If you have the dump, I suggest starting to make sure that the act_tr array looks sane. Make
> sure all the live pointers point to a sane looking tr. Make sure that tr is on the active list, etc
> 
> It will take a fair amount of driver reading, though, to see how we got here. I'd also check to
> make sure that qpair->num_enttries > cpl.cid (0x3e in this case).
> 

Okay. I got this while trying to test an odd diff -r over NFS
issue with the more recent software. So the two will potentially
compete for time.

As investigation will be exploratory for me, not familiar, I'll
probably publish periodic notes on things as I go along looking
at stuff.

My first is that the /var/crash/core.txt.0 has a gdb backtrace:

. . .
#10 0xffff00000047900c in panic (
    fmt=0x12 <error: Cannot access memory at address 0x12>)
    at /usr/main-src/sys/kern/kern_shutdown.c:843
#11 0xffff0000002226b4 in nvme_qpair_process_completions (
    qpair=qpair@entry=0xffffa00008724300)
    at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617
#12 0xffff000000223354 in nvme_timeout (arg=arg@entry=0xffffa0000b053980)
    at /usr/main-src/sys/dev/nvme/nvme_qpair.c:938
#13 0xffff000000495bf8 in softclock_call_cc (c=0xffffa0000b0539a0, 
    cc=cc@entry=0xffff000000de3500 <cc_cpu+768>, direct=0)
    at /usr/main-src/sys/kern/kern_timeout.c:696
#14 0xffff000000495fb0 in softclock (arg=0xffff000000de3500 <cc_cpu+768>)
    at /usr/main-src/sys/kern/kern_timeout.c:816
#15 0xffff0000004356dc in intr_event_execute_handlers (p=<optimized out>, 
    ie=0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1168
#16 ithread_execute_handlers (p=<optimized out>, ie=0xffffa000058bc700)
    at /usr/main-src/sys/kern/kern_intr.c:1181
#17 ithread_loop (arg=<optimized out>, arg@entry=0xffffa000058aef60)
    at /usr/main-src/sys/kern/kern_intr.c:1269
#18 0xffff000000431f6c in fork_exit (
    callout=0xffff000000435430 <ithread_loop>, arg=0xffffa000058aef60, 
    frame=0xffff0000eb7cc990) at /usr/main-src/sys/kern/kern_fork.c:1083
#19 <signal handler called>

So via kgdb . . .

(kgdb) up 11
#11 0xffff0000002226b4 in nvme_qpair_process_completions (qpair=qpair@entry=0xffffa00008724300) at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617
617				KASSERT(0, ("received completion for unknown cmd"));

(kgdb) print/x cpl.cid
$4 = 0x7e
(kgdb) print/x qpair->num_entries
$5 = 0x100

Based on also seeing the code:

        qpair->act_tr = malloc_domainset(sizeof(struct nvme_tracker *) *
            qpair->num_entries, M_NVME, DOMAINSET_PREF(qpair->domain),
            M_ZERO | M_WAITOK);

(kgdb) print qpair->act_tr
$6 = (struct nvme_tracker **) 0xffffa00008725800
(kgdb) x/256g 0xffffa00008725800
0xffffa00008725800:	0x0000000000000000	0x0000000000000000
0xffffa00008725810:	0x0000000000000000	0x0000000000000000
. . .
0xffffa00008725fe0:	0x0000000000000000	0x0000000000000000
0xffffa00008725ff0:	0x0000000000000000	0x0000000000000000

It was all zeros (null pointers). No "live" pointers and, so,
no tr's to inspect.

As none of this is familiar context beyond general programming
concepts, it may be some time before I find anything else
potentially of interest to report. If you have other specific
things you would like me to look at, let me know.


===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)