Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot

From: Mark Millard via freebsd-current <freebsd-current_at_freebsd.org>
Date: Sat, 29 May 2021 08:15:34 UTC
On 2021-May-23, at 00:46, Mark Millard via freebsd-current <freebsd-current at freebsd.org> wrote:

> On 2021-May-23, at 00:08, Mark Millard via freebsd-current <freebsd-current at freebsd.org> wrote:
> 
>> On 2021-May-22, at 22:16, Warner Losh <imp at bsdimp.com> wrote:
>> 
>>> On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm <freebsd-arm@freebsd.org> wrote:
>>> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
>>> # diff -r /usr/ports/ /mnt/ | more
>>> nvme0: cpl does not map to outstanding cmd
>>> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
>>> panic: received completion for unknown cmd
>>> 
>>> cid 0x7e has no currently active command. The cid is used by the driver
>>> to map completions back to requests.
>>> 
>>> So, there's usually 3 possibilities that I've seen this with.
>>> 
>>> (1) There's a missing cache flush so you get a bogus cpl back because something stale
>>> was read. It's unlikely to be this one because the rest of this look like a successful
>>> command completed: sc = 0 is successful completion and sct is a generic command queued.
>>> 
>>> (2) We're looking at the completion record twice because we failed to properly update the
>>> head pointer and we've already completed the command. I've only ever seen this in a
>>> panic situation where we interrupt the completion routine because something else
>>> paniced.
>>> 
>>> (3) There's something that's corrupting the act_tr array in the qpair. I've not seen this,
>>> but if something else smashes that area (zeroing it in this case), then that could cause
>>> an error like this.
>> 
>> Of note may be that I buildworld and buildkernel with extra
>> tuning enabled, targeting the cortex-a72. In one past example
>> this lead to finding a missing synchronization related to XHCI
>> handling that was fixed. (The fix was not aarch64 specific at
>> all.) For that: A cortex-a53 did not show the problem with or
>> without that tuning. A cortex-a72 showed the problem only with
>> the cortex-a72 tuning, not with targeting a cortex-a53 tuning
>> or generic armv7, for example.
>> 
>> Not that I've any evidence specifically suggesting such would
>> be involved here. But it might be good to keep in mind as a
>> possaibility.
>> 
>>> Or it could be something new I've not seen nor thought about before.
>>> 
>>> cpuid = 3
>>> time = 1621743752
>>> KDB: stack backtrace:
>>> db_trace_self() at db_trace_self
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>>> vpanic() at vpanic+0x188
>>> panic() at panic+0x44
>>> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc
>>> nvme_timeout() at nvme_timeout+0x3c
>>> softclock_call_cc() at softclock_call_cc+0x124
>>> softclock() at softclock+0x60
>>> ithread_loop() at ithread_loop+0x2a8
>>> fork_exit() at fork_exit+0x74
>>> fork_trampoline() at fork_trampoline+0x14
>>> KDB: enter: panic
>>> [ thread pid 12 tid 100028 ]
>>> Stopped at      kdb_enter+0x48: undefined       f904411f
>>> db> 
>>> 
>>> Based on the "nvme" references, I expect this is tied to
>>> handling the Optane 480 GiByte that is in the PCIe slot
>>> and is the boot/only media for the machine doing the diff.
>>> 
>>> "db> dump" seems to have worked.
>>> 
>>> After the reboot, zpool scrub found no errors.
>>> 
>>> For reference:
>>> 
>>> # uname -apKU
>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72  arm64 aarch64 1400013 1400013
>>> 
>>> If you have the dump, I suggest starting to make sure that the act_tr array looks sane. Make
>>> sure all the live pointers point to a sane looking tr. Make sure that tr is on the active list, etc
>>> 
>>> It will take a fair amount of driver reading, though, to see how we got here. I'd also check to
>>> make sure that qpair->num_enttries > cpl.cid (0x3e in this case).
>>> 
>> 
>> Okay. I got this while trying to test an odd diff -r over NFS
>> issue with the more recent software. So the two will potentially
>> compete for time.
>> 
>> As investigation will be exploratory for me, not familiar, I'll
>> probably publish periodic notes on things as I go along looking
>> at stuff.
>> 
>> My first is that the /var/crash/core.txt.0 has a gdb backtrace:
>> 
>> . . .
>> #10 0xffff00000047900c in panic (
>>   fmt=0x12 <error: Cannot access memory at address 0x12>)
>>   at /usr/main-src/sys/kern/kern_shutdown.c:843
>> #11 0xffff0000002226b4 in nvme_qpair_process_completions (
>>   qpair=qpair@entry=0xffffa00008724300)
>>   at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617
>> #12 0xffff000000223354 in nvme_timeout (arg=arg@entry=0xffffa0000b053980)
>>   at /usr/main-src/sys/dev/nvme/nvme_qpair.c:938
>> #13 0xffff000000495bf8 in softclock_call_cc (c=0xffffa0000b0539a0, 
>>   cc=cc@entry=0xffff000000de3500 <cc_cpu+768>, direct=0)
>>   at /usr/main-src/sys/kern/kern_timeout.c:696
>> #14 0xffff000000495fb0 in softclock (arg=0xffff000000de3500 <cc_cpu+768>)
>>   at /usr/main-src/sys/kern/kern_timeout.c:816
>> #15 0xffff0000004356dc in intr_event_execute_handlers (p=<optimized out>, 
>>   ie=0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1168
>> #16 ithread_execute_handlers (p=<optimized out>, ie=0xffffa000058bc700)
>>   at /usr/main-src/sys/kern/kern_intr.c:1181
>> #17 ithread_loop (arg=<optimized out>, arg@entry=0xffffa000058aef60)
>>   at /usr/main-src/sys/kern/kern_intr.c:1269
>> #18 0xffff000000431f6c in fork_exit (
>>   callout=0xffff000000435430 <ithread_loop>, arg=0xffffa000058aef60, 
>>   frame=0xffff0000eb7cc990) at /usr/main-src/sys/kern/kern_fork.c:1083
>> #19 <signal handler called>
>> 
>> So via kgdb . . .
>> 
>> (kgdb) up 11
>> #11 0xffff0000002226b4 in nvme_qpair_process_completions (qpair=qpair@entry=0xffffa00008724300) at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617
>> 617				KASSERT(0, ("received completion for unknown cmd"));
>> 
>> (kgdb) print/x cpl.cid
>> $4 = 0x7e
>> (kgdb) print/x qpair->num_entries
>> $5 = 0x100
>> 
>> Based on also seeing the code:
>> 
>>       qpair->act_tr = malloc_domainset(sizeof(struct nvme_tracker *) *
>>           qpair->num_entries, M_NVME, DOMAINSET_PREF(qpair->domain),
>>           M_ZERO | M_WAITOK);
>> 
>> (kgdb) print qpair->act_tr
>> $6 = (struct nvme_tracker **) 0xffffa00008725800
>> (kgdb) x/256g 0xffffa00008725800
>> 0xffffa00008725800:	0x0000000000000000	0x0000000000000000
>> 0xffffa00008725810:	0x0000000000000000	0x0000000000000000
>> . . .
>> 0xffffa00008725fe0:	0x0000000000000000	0x0000000000000000
>> 0xffffa00008725ff0:	0x0000000000000000	0x0000000000000000
>> 
>> It was all zeros (null pointers). No "live" pointers and, so,
>> no tr's to inspect.
>> 
>> As none of this is familiar context beyond general programming
>> concepts, it may be some time before I find anything else
>> potentially of interest to report. If you have other specific
>> things you would like me to look at, let me know.
>> 
> 
> A fairly obvious thing I should have provided:
> 
> (kgdb) print/x *qpair
> $15 = {ctrlr = 0xffff0000fe154000, id = 0x3, domain = 0x0, cpu = 0x2, vector = 0x3, rid = 0x4, res = 0xffffa000086ded80, tag = 0xffffa0000877b780, num_entries = 0x100, num_trackers = 0x80, 
>  sq_tdbl_off = 0x1018, cq_hdbl_off = 0x101c, phase = 0x1, sq_head = 0x1f, sq_tail = 0x20, cq_head = 0x20, num_cmds = 0x420, num_intr_handler_calls = 0xe66c, num_retries = 0x0, num_failures = 0x0, 
>  cmd = 0xffff000100ebb000, cpl = 0xffff000100ebf000, dma_tag = 0xffffa0000b093e00, dma_tag_payload = 0xffffa000059ef000, queuemem_map = 0xffffa00005a07700, cmd_bus_addr = 0xacbb000, 
>  cpl_bus_addr = 0xacbf000, free_tr = {tqh_first = 0xffffa0000b053a80, tqh_last = 0xffffa0000869da80}, outstanding_tr = {tqh_first = 0xffffa0000b053980, tqh_last = 0xffffa0000b053980}, queued_req = {
>    stqh_first = 0x0, stqh_last = 0xffffa000087243c8}, act_tr = 0xffffa00008725800, is_enabled = 0x1, lock = {lock_object = {lo_name = 0xffff00000090321f, lo_flags = 0x1030000, lo_data = 0x0, 
>      lo_witness = 0xffffa0043fd96080}, mtx_lock = 0x0}}
> 
> Looks like I need to boot into the non-debug builds for the
> other problem I'm testing for repeatability after a commit.


I've no figured out anything interesting so far.

But I've run into something that looks odd to me
(not that I've any evidence it is related to the
panic, more likely my ignorance):

There is a use of atomic_store_rel_int(&qpair->cq_head, 0)
for which I do not find any matching atomic_load_acq_int
use (or other explicit _acq), so so there is no "synchronizes
with" status in the code to establish an ordering across
threads that involves the atomic_store_rel_int that I've
found, just implicit/default relaxed-load-operations. A grep
for "cq_head" under dev/nvme/ shows only:

./dev/nvme/nvme_private.h:      uint32_t                cq_head;
./dev/nvme/nvme_sysctl.c:       SYSCTL_ADD_UINT(ctrlr_ctx, que_list, OID_AUTO, "cq_head",
./dev/nvme/nvme_sysctl.c:           CTLFLAG_RD, &qpair->cq_head, 0,
./dev/nvme/nvme_qpair.c:         * below, but before we can reset cq_head to zero at 2. Also cope with
./dev/nvme/nvme_qpair.c:                if (qpair->cq_head == qpair->num_entries) {
./dev/nvme/nvme_qpair.c:                         * Here we know that we need to zero cq_head and then negate
./dev/nvme/nvme_qpair.c:                         * the phase, which hasn't been assigned if cq_head isn't
./dev/nvme/nvme_qpair.c:                        qpair->cq_head = 0;
./dev/nvme/nvme_qpair.c:                } else if (qpair->cq_head == 0) {
./dev/nvme/nvme_qpair.c:                cpl = qpair->cpl[qpair->cq_head];
./dev/nvme/nvme_qpair.c:                         * qpair->cq_head at 1 below.  Later, we re-enter this
./dev/nvme/nvme_qpair.c:                         * won't have updated cq_head. Rather than panic again,
./dev/nvme/nvme_qpair.c:                        nvme_dump_completion(&qpair->cpl[qpair->cq_head]);
./dev/nvme/nvme_qpair.c:                if (++qpair->cq_head == qpair->num_entries) {           /* 1 */
./dev/nvme/nvme_qpair.c:                        atomic_store_rel_int(&qpair->cq_head, 0);       /* 2 */
./dev/nvme/nvme_qpair.c:                    qpair->cq_hdbl_off, qpair->cq_head);
./dev/nvme/nvme_qpair.c:        qpair->sq_head = qpair->sq_tail = qpair->cq_head = 0;

(2 lines above the last has the atomic_store_rel_int use.)

atomic_thread_fence_rel use would have "synchronizes
with" based on ordinary loads reading something stored
after the atomic_thread_fence_rel. Such is documented
in "man atomic". But that is not what this code is doing.
"man atomic" does not mention ordinary loads getting such
a status by reading what an atomic_store_rel_int wrote.
It does reference the atomic_thread_fence_rel related
status for ordinary loads.

So I'm clueless about what is intended to be going on
relative to that "atomic_store_rel_int(&qpair->cq_head, 0)".
Overall, the code does not appear to me to match up with the
aarch64, powerpc64, or powerpc code generation requirements
to have any matching "synchronizes with" relationships.
(I'll not list machine instruction sequences here. I'd have
to look up the detailed sequences. But, as I remember, more
than a load is involved in the code sequences for the
acquire side on these types of processors. Nothing in the
source indicates to generate the additional code as far as
I can tell.)

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)