Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot
- Reply: Mark Millard via freebsd-current : "Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot"
- In reply to: Mark Millard via freebsd-arm : "I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 23 May 2021 05:16:02 UTC
On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm < firstname.lastname@example.org> wrote: > # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/ > # diff -r /usr/ports/ /mnt/ | more > nvme0: cpl does not map to outstanding cmd > cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0 > panic: received completion for unknown cmd > cid 0x7e has no currently active command. The cid is used by the driver to map completions back to requests. So, there's usually 3 possibilities that I've seen this with. (1) There's a missing cache flush so you get a bogus cpl back because something stale was read. It's unlikely to be this one because the rest of this look like a successful command completed: sc = 0 is successful completion and sct is a generic command queued. (2) We're looking at the completion record twice because we failed to properly update the head pointer and we've already completed the command. I've only ever seen this in a panic situation where we interrupt the completion routine because something else paniced. (3) There's something that's corrupting the act_tr array in the qpair. I've not seen this, but if something else smashes that area (zeroing it in this case), then that could cause an error like this. Or it could be something new I've not seen nor thought about before. > cpuid = 3 > time = 1621743752 > KDB: stack backtrace: > db_trace_self() at db_trace_self > db_trace_self_wrapper() at db_trace_self_wrapper+0x30 > vpanic() at vpanic+0x188 > panic() at panic+0x44 > nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc > nvme_timeout() at nvme_timeout+0x3c > softclock_call_cc() at softclock_call_cc+0x124 > softclock() at softclock+0x60 > ithread_loop() at ithread_loop+0x2a8 > fork_exit() at fork_exit+0x74 > fork_trampoline() at fork_trampoline+0x14 > KDB: enter: panic > [ thread pid 12 tid 100028 ] > Stopped at kdb_enter+0x48: undefined f904411f > db> > > Based on the "nvme" references, I expect this is tied to > handling the Optane 480 GiByte that is in the PCIe slot > and is the boot/only media for the machine doing the diff. > > "db> dump" seems to have worked. > > After the reboot, zpool scrub found no errors. > > For reference: > > # uname -apKU > FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 > main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 > root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72 > arm64 aarch64 1400013 1400013 > If you have the dump, I suggest starting to make sure that the act_tr array looks sane. Make sure all the live pointers point to a sane looking tr. Make sure that tr is on the active list, etc It will take a fair amount of driver reading, though, to see how we got here. I'd also check to make sure that qpair->num_enttries > cpl.cid (0x3e in this case). Warner