From nobody Sun May 23 07:08:20 2021 X-Original-To: freebsd-arm@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 682C89F353E for ; Sun, 23 May 2021 07:08:30 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic317-22.consmr.mail.gq1.yahoo.com (sonic317-22.consmr.mail.gq1.yahoo.com [98.137.66.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4FnrzL0vZBz4jyj for ; Sun, 23 May 2021 07:08:29 +0000 (UTC) (envelope-from marklmi@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1621753707; bh=dfqgXuvHxnuPMlMWAPzNRSFYX3iYoITQFa7ZPJ8CGWQ=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject:Reply-To; b=hSxC4pk3+qaYK5xCmPwyCCMf1jxKwlIAviSd10nTNDHki2aVRCKSKGCrdaDSbcJ55oObFaxvh5SSAgBZ2ZTEpI5PzCHLp+BPZYc/y43f0Pb9cqW7Sj3G1C94bHQZRm2/42WnXrYvj/e8IF0jQg+Zh5odvWgkYXmcS+uGHjhXr3B6q/e9x2oTpCzwZkBtI5MIBOkaUNEspoXz2ONd5ndACKU41Mmvpa1RtIRLK/bSEsdNrsdE2ysr3bZAP9DLp5EFqsiCFnrsfX7mgFGUYlFL5WzpioD/SlBe+ggiyzA1AW2KWGnIg1G0Ppv/k/OvvUVIMN+HGSkjoMO0VSJpJUCJwA== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1621753707; bh=oxG3yZspplSSsK28SGXBV0DV1myF73kfbkVZQJgvpr0=; h=X-Sonic-MF:Subject:From:Date:To:From:Subject; b=orYivNPTDvEbFUKVp0txicBK3XWxqG6t4rGdP6OM6JNVOq3ZuFTaDMBn/SB9Tww9G/T4jD1NpLJ6wWJuYlLkQTbjwn0Lyr2FnFWF6kpwNOwIjP+oP9qIk7O0E4YKYXjrBKSFtfNGF7LRVB3p/dnONEJVix/MuQcjpUHiKQ5lK5sXqy428IsfbrxEaZtl1+idPmTAQmqyYGIVFzNqtyRAWjOVJdusJ1LSKjErr6etdBpV9d8PpIpkiirtr9X8PdQDNOi1rJC0nUc2EJkcdvNuHNSXCxKWyGo8RdFxfijYCbNxvA9pRevEnl+WYr23u/CPOoPwedaH1JcyjgTXD628rg== X-YMail-OSG: gRRWeLUVM1nHnkOqq9slDvoggcPHOQnBKuVv53T1qF1ioylaYmfd.BbuFDU6kKc xXZFDqZjIeChIFLAZr10DhE2wx04XhZLI_nQE7NgRvxpJlVsW7DfQU72QN2fAPMHHOzyijV2mqPn iyrWxGpoZGBBTUapElUsNDlFXzhl6aJ3q_.kShHq5T9vfNAkW20dqIg40o2WjpSi.xDj1kQjVWYf gE1oWuOQJQ2KK5TeOp2zGhYdbIyY2ZBjC4sClFO3bN9eMWmoYksrM_dgvTfq_qGL3uSx4eLoaCn4 7TJiVTX2WODD3sAIrRU8xljQIkxxujhx93ei1i.wNmAIAmPeV7xLiQvXZI_LLdsi8UEUvh3nD_W0 MBP95Qc_kho_3lnncfN5JtFXiQi13zElC5gTh7lGm7TE5s7LkditphleDWITmgqOFF4KCGvlzAYy wITONi79R3kc3Q.Ed4VTQb5bYQSzX27nEQuaI9o7KwhaoMMeX8A710ynEuMxobB7aVCqXYVSY9x6 B1u5t3WSxeZC5EArEcKvSV4XuhcD.S3PCMuvST.JwNmiPw.z6s09iUrHErigbBe6IgJdL1y99SKm vnNqzQb9yFs66_RdsVCyrN63lpeCn6t6_RYmzC.f5WsaSRJok73fKJmeQc3hqM5w8YUqmTtw8Bu6 M4BP4Dma4N1WXSnVy4M3AM6JHloJrU.pZOku9RPo8tEns2k58SfFDujamnVMXK5J5WAegI7gTMng kojQKl9CpGrenc8JsO1g.Y75pVkWFkM3VkqccNEobKstgl7FvLOTV1XKnxF0EolVBGDDcjTWbhMq 7QZOMVvwKtSLVT_RNDsQlfmF_isg5HECll4r4Dp2AoEKOEAU84O3kGf_JBPyN1ldqUX04vpasf_o KBlofympcAu4oiBa57RuQmyR0sQ1C54CaQMqe3BSQUXXO03khVBlZXlHBAI_GGJ6NVRXAFCwZRzz lMYu56Ns7D1rtc.k0Ujln0cxRClYk7ic5.p9Y4SCkA5ApzItrbnnDbRDkiYnbGB6ms8KWxuWxvXy 8AziY0y4rYzzdKLmnFFe_zGqaaC5U57th6ukyTk2CAe4KwOmTfvEs_Puxlf1mTevXr19rBKcr_OG Xl_iMVGAzqOEhLLVcd45kDGH.PA7X08P7bkFCxiWpI8HDJtKOjNgLV13YC8Bjy5.0k63Ip_vEzA4 tp2gf1KDDWAOIEfVRX8cCdJB0UM57B2Anddd0dIhxt03GgJAN0R.B6vKGOAFfLk3xXZE_Rk8iVp. J6xhsdC65qH3ScyYnsyENWst6ihrVXv._Bn50jl6MSvPFB8aWeG0jGMoKxpusyuLIe35JQTV10lh KBlSMlj44cj_6P_QZcWNot0pO.V_21x47nCPnowvf8j3QwkMbP9JE0fLliZRr_MoJ55ZJ0e3Siyi 3Qsxzni_Y5IKTyiKmc23OLZqoIX5HWR64TCM7hd3bgenxmy_G1EW4dJN6kZtDxYLFoAtSX7Cpnrc k9qNsz84o4mlxRs6LGQGhOjEGpYBkW8GJsKoUeJYgJOgIAivJ9QO4fGX8aLATwZpGeoy8e9SmBt3 Ziex4_wkdO1dqsbuSxxpN0.9ucxiGjv723JywuDMBMTzHfYvedrvN.0oGJto_akN0VhFM3_Oehed ruLiGg61hAwXoVDy9xew0_iBxBC3EJU12ZzDcQ04wrH1UBV7oB1n1PqBrG_zDqAQofaAZUJ5Infm r8C3v69Glm_Ize2Up1YtiVqBAS6Hgsg7JIzgK1AgTqsR1tAPLbtPuUcvtDErY.Sl.mCCxcRvbhaT 37In5yauADSlK3RQhpbJGiHCvwseuwVnHfN_0ddQPS76qTCnIeZ_detYJm02NjUA3hD2HQhDFNRr P2vqt.He0bk_VhJXYenbZUggPxtksCLD4UA.nYbS27RjO2F2rfr7lEcrD6kqHX3saHSt1P2rIYj1 uQmt4tupoJMEAFRpMDVAFe7ykIIN3zulQaDij652t1OKZoNgbQ9yqlb3L90wxKwXEaDoz3mAfgBs PSqB4BTSBBFa0ITT0iFmM.nvvlve0pt1OSiXRc3lmnle_GakIvWhx2Ldb8tdsbuBT3J1K3K5IuiH 7sZIDE6C0GLuxSfhm6v5VYKMH8O6A55CgWRQz.puCZHf8HcMaBB.manN6Ckp2fKg0XsqZae4x2de o7tVfXaoaqFSqvwG8JmERP0aa7aHLcSwDSFkn2ddjoxTlwkEmqyS.ZV7V8yTFNJ9xNqjkpcKSirv G0hZUHOWfdyXLKjxLJQwLifNslchpiNS1GUmxTsnOUVa7RfLZ5fK3.erTLG1oErHEbB.nGlniI6z e5wTi8O9Jir7I5NcG8b3QBGWx7sIRdq_MDMzVJKQeYzjnTJK4fSeGZEFZouC1KXDR9FDE1XAno8Z DTEMc7.Xk7sL1xR1mUt3Pj_14YJ15wCZXoDdSbcQJifp31x8uHu31i_noSbjyDvcg45bPrhvk_Ci iQCwccJAwi3o9OKtFMd8MM1J7X.HYV.gG7drFJzqQg_yAomW2RDbfrANAQtXx0LN4RPGJbee0PVU u0dVfOK5Bhfaj75SLxW5vCtEYG.NDSYEYv2JCgN2SA2FdAk1UGMeakA6nAUkjWj2UvT9WFZLOMl1 1OsypSFj8X8wzYK4ctMV7XpzUGMVg9GnLxZQ25_Yek47x_aFYzweIg7_WZ8gQO3bGh.F49kQYZCd 321dRjaZ3VqN4fU2Aeuy9qSDrKW_UZEejuNk5XwxWoCfvyANUlq4- X-Sonic-MF: Received: from sonic.gate.mail.ne1.yahoo.com by sonic317.consmr.mail.gq1.yahoo.com with HTTP; Sun, 23 May 2021 07:08:27 +0000 Received: by kubenode542.mail-prod1.omega.bf1.yahoo.com (VZM Hermes SMTP Server) with ESMTPA ID 0d27a10593e268a16525731768bc367d; Sun, 23 May 2021 07:08:22 +0000 (UTC) Content-Type: text/plain; charset=us-ascii List-Id: Porting FreeBSD to ARM processors List-Archive: https://lists.freebsd.org/archives/freebsd-arm List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arm@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.80.0.2.43\)) Subject: Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot In-Reply-To: Date: Sun, 23 May 2021 00:08:20 -0700 Cc: freebsd-current , freebsd-arm Content-Transfer-Encoding: quoted-printable Message-Id: References: <063D5E36-126F-497C-97AF-827BADC1ED2F.ref@yahoo.com> <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com> To: Warner Losh X-Mailer: Apple Mail (2.3654.80.0.2.43) X-Rspamd-Queue-Id: 4FnrzL0vZBz4jyj X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] Reply-To: marklmi@yahoo.com From: Mark Millard via freebsd-current X-Original-From: Mark Millard On 2021-May-22, at 22:16, Warner Losh wrote: > On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm = wrote: > # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/ > # diff -r /usr/ports/ /mnt/ | more > nvme0: cpl does not map to outstanding cmd > cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0 > panic: received completion for unknown cmd >=20 > cid 0x7e has no currently active command. The cid is used by the = driver > to map completions back to requests. >=20 > So, there's usually 3 possibilities that I've seen this with. >=20 > (1) There's a missing cache flush so you get a bogus cpl back because = something stale > was read. It's unlikely to be this one because the rest of this look = like a successful > command completed: sc =3D 0 is successful completion and sct is a = generic command queued. >=20 > (2) We're looking at the completion record twice because we failed to = properly update the > head pointer and we've already completed the command. I've only ever = seen this in a > panic situation where we interrupt the completion routine because = something else > paniced. >=20 > (3) There's something that's corrupting the act_tr array in the qpair. = I've not seen this, > but if something else smashes that area (zeroing it in this case), = then that could cause > an error like this. Of note may be that I buildworld and buildkernel with extra tuning enabled, targeting the cortex-a72. In one past example this lead to finding a missing synchronization related to XHCI handling that was fixed. (The fix was not aarch64 specific at all.) For that: A cortex-a53 did not show the problem with or without that tuning. A cortex-a72 showed the problem only with the cortex-a72 tuning, not with targeting a cortex-a53 tuning or generic armv7, for example. Not that I've any evidence specifically suggesting such would be involved here. But it might be good to keep in mind as a possaibility. > Or it could be something new I've not seen nor thought about before. > =20 > cpuid =3D 3 > time =3D 1621743752 > KDB: stack backtrace: > db_trace_self() at db_trace_self > db_trace_self_wrapper() at db_trace_self_wrapper+0x30 > vpanic() at vpanic+0x188 > panic() at panic+0x44 > nvme_qpair_process_completions() at = nvme_qpair_process_completions+0x1fc > nvme_timeout() at nvme_timeout+0x3c > softclock_call_cc() at softclock_call_cc+0x124 > softclock() at softclock+0x60 > ithread_loop() at ithread_loop+0x2a8 > fork_exit() at fork_exit+0x74 > fork_trampoline() at fork_trampoline+0x14 > KDB: enter: panic > [ thread pid 12 tid 100028 ] > Stopped at kdb_enter+0x48: undefined f904411f > db>=20 >=20 > Based on the "nvme" references, I expect this is tied to > handling the Optane 480 GiByte that is in the PCIe slot > and is the boot/only media for the machine doing the diff. >=20 > "db> dump" seems to have worked. >=20 > After the reboot, zpool scrub found no errors. >=20 > For reference: >=20 > # uname -apKU > FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 = main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 = root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.= aarch64/sys/GENERIC-DBG-CA72 arm64 aarch64 1400013 1400013 >=20 > If you have the dump, I suggest starting to make sure that the act_tr = array looks sane. Make > sure all the live pointers point to a sane looking tr. Make sure that = tr is on the active list, etc >=20 > It will take a fair amount of driver reading, though, to see how we = got here. I'd also check to > make sure that qpair->num_enttries > cpl.cid (0x3e in this case). >=20 Okay. I got this while trying to test an odd diff -r over NFS issue with the more recent software. So the two will potentially compete for time. As investigation will be exploratory for me, not familiar, I'll probably publish periodic notes on things as I go along looking at stuff. My first is that the /var/crash/core.txt.0 has a gdb backtrace: . . . #10 0xffff00000047900c in panic ( fmt=3D0x12 ) at /usr/main-src/sys/kern/kern_shutdown.c:843 #11 0xffff0000002226b4 in nvme_qpair_process_completions ( qpair=3Dqpair@entry=3D0xffffa00008724300) at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617 #12 0xffff000000223354 in nvme_timeout = (arg=3Darg@entry=3D0xffffa0000b053980) at /usr/main-src/sys/dev/nvme/nvme_qpair.c:938 #13 0xffff000000495bf8 in softclock_call_cc (c=3D0xffffa0000b0539a0,=20 cc=3Dcc@entry=3D0xffff000000de3500 , direct=3D0) at /usr/main-src/sys/kern/kern_timeout.c:696 #14 0xffff000000495fb0 in softclock (arg=3D0xffff000000de3500 = ) at /usr/main-src/sys/kern/kern_timeout.c:816 #15 0xffff0000004356dc in intr_event_execute_handlers (p=3D,=20 ie=3D0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1168 #16 ithread_execute_handlers (p=3D, = ie=3D0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1181 #17 ithread_loop (arg=3D, arg@entry=3D0xffffa000058aef60) at /usr/main-src/sys/kern/kern_intr.c:1269 #18 0xffff000000431f6c in fork_exit ( callout=3D0xffff000000435430 , arg=3D0xffffa000058aef60,= =20 frame=3D0xffff0000eb7cc990) at = /usr/main-src/sys/kern/kern_fork.c:1083 #19 So via kgdb . . . (kgdb) up 11 #11 0xffff0000002226b4 in nvme_qpair_process_completions = (qpair=3Dqpair@entry=3D0xffffa00008724300) at = /usr/main-src/sys/dev/nvme/nvme_qpair.c:617 617 KASSERT(0, ("received completion for = unknown cmd")); (kgdb) print/x cpl.cid $4 =3D 0x7e (kgdb) print/x qpair->num_entries $5 =3D 0x100 Based on also seeing the code: qpair->act_tr =3D malloc_domainset(sizeof(struct nvme_tracker *) = * qpair->num_entries, M_NVME, DOMAINSET_PREF(qpair->domain), M_ZERO | M_WAITOK); (kgdb) print qpair->act_tr $6 =3D (struct nvme_tracker **) 0xffffa00008725800 (kgdb) x/256g 0xffffa00008725800 0xffffa00008725800: 0x0000000000000000 0x0000000000000000 0xffffa00008725810: 0x0000000000000000 0x0000000000000000 . . . 0xffffa00008725fe0: 0x0000000000000000 0x0000000000000000 0xffffa00008725ff0: 0x0000000000000000 0x0000000000000000 It was all zeros (null pointers). No "live" pointers and, so, no tr's to inspect. As none of this is familiar context beyond general programming concepts, it may be some time before I find anything else potentially of interest to report. If you have other specific things you would like me to look at, let me know. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)