From nobody Sun May 23 07:46:28 2021 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 5A66D9EFA7D for ; Sun, 23 May 2021 07:46:34 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic303-23.consmr.mail.gq1.yahoo.com (sonic303-23.consmr.mail.gq1.yahoo.com [98.137.64.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4FnsqF20d0z3Lql for ; Sun, 23 May 2021 07:46:33 +0000 (UTC) (envelope-from marklmi@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1621755991; bh=jpUAxwPUKV4YRVE0U49BwtDYeq4DS9TcdHFLKPAKesk=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject:Reply-To; b=O/e+9MhOI/b8U8L5XEeZv5jvwD/KUvzkjwLe6xFDnqGOfUknaoadrtohZR/HwIhMiDMir0Fp35FTNbQWzErMlA2NJRRYJ4MEBQvt0mi7z3igcCAvF/F0CjLXj0xGU9LJmioV9unMp901vAZd8iA0XEYvBI4/cHrlsvKmezhy0OXjuChnbKNi9GNALaiPRrkj51mEXvXKM4b8PUDhTqD0SgcyqzVv530yIZdHuOmWqjWEUodXL2Tgu0MJUO0C85X35HK/XxoBtKlQd2dFMvWcLyuhR0/gFfUg5cOeU7+C/nSgVeb7Z+kzR84ccuQkV+Zm3GCWt4AWN0f3i0bGov5hOQ== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1621755991; bh=tRQ7LHWu2ZNkmjYX7h2hfbGSXVFlPGZ+t/+IbA+MPvQ=; h=X-Sonic-MF:Subject:From:Date:To:From:Subject; b=MYUqKJ+gJ372GzRfpNeolpKfT8gmApa4Ypt8Ia5iHMqcnr1Wm4PLgC4InOlVrYU/72E6NcAqqujZ3CkoALGaykZVWGMRiNI6wILe1rbu7iMbw9f7kkKKlJ7mEKfgR2R0F52aZxpYo2ER2SQ1TJPo8Q2wXWK54NYR1nIP2BgKQesoxd8RwzkLw5ynU7BWmG5/7zPymJo6KH9biwojj4Y2NbutiyQifi5SYqrwxy+00gB4j5/ct/W9b1Y9j3KLd1htz1JkBEdixfC+TrAUXaJ046w9Rm0MbP0jpaWtHMiBLgLTXDgCVdkCsSyO9iiM3DI9PEwysK6THVBC3qW796NbDw== X-YMail-OSG: OO.S77gVM1k1eVFVErNFy7UFSFZPII9OzbZgAJ6MXYI4ZNF51ulTg2zuDncEkIA tQs3PxAwX5nfL5EmB4rytdMu8bUV6qwPs0qpwiwZS1sCmprzYAWM75MeWydAl1_tFgUoWI6_bP3j 7z2WUZyZBFTM686.kZ8sIkryRhG2R4CdynlHRQIzNREkfmMsGAV4U0bBH6wMkp2VfnG3Ft3_7e_e OSkLNpv.u6Zvrh9HEg6Cnuy331jQM_YydwQX_B..OAIm1sigNtXdJL3zifCP1.7vQg_TseCDfDfj YGplvWuUcVbchs5P6aCDo8SzAyQYIQRb9t2o2xRD.jVgud83Mjt1JcEy0lDN6Z_sG9nnNmgjC3NB 24abP.k_xamk_B1lIbXcw2hzXyrRJtrsrqCsz8Rh89tGTC3vUti6zHhpsAssnnN.816VecNlQCjx mJP5uoBvPWRTgqMSRLz2J0.VOiVsLQbG5YZE6U6AwKGBHgdqAUFQnrUgQ_n48HenO6geEYAJPV6u JXeBe01S2QSw.juPyAXh8KXhhEs2tvpJM.BmTlxtrYHIDnIb6EkKO26fMKgtW2hB.8azTGsP159v UzF3yEeibmwrLp3XBR.TdHC5yVCv4.FSEfkavvcVMqb1Erdx5zaWhChWvcxdtBFdBZwHon3tDJLM Hyhd9mr8_.rJ9rjLI2PVeAlAR5UPfeXsvTvh2R71dpDS2tturrdpUU_z5M2TzTfqq6rS81MuId0g wOnMj.KAjtfpiyOXLHJ9Cn.qH9.fkZCI.Gc5IFXNPhj672xjGLu2DhvsEHgsRizwE68.1s2TRSVo xOl2SCeOQCxV3J.WilT9Kma2phMmgRek0vOCvAxyulQY.w3MTHlqE49P9B.cl9xfqy2RkiKi3Ypn BJ8aQmMAVHVBPF4.u_pgHJEpCfrBo6yrTIFt_Cc73aSbApvddSL4Qp9GSXksdw1lUrvdsYHACFLT 4gHTcQ1i5skuHYDwkD55r0uHZMoATWvdFUcYqFQ4gO7mzDIDYl3W9XYFquuxd8sKOtvYVIGalt5a yEycWbVTIEuTXWqwtoA4IMtUSKXWDBNI1brXv5IvOnIuVasSRKJkgmW89C3RtiZaxnzoNycFMMkU TYGSSGKA9NIhmSZ6K5aOVaCqT7GCFMT54Yj8cuiRUubR6XjTTxewzIICvF3q7UtR9W2PTmz.5QLB n_.Mu34aCgWmyApA.uZBpFi8mPUNN5.Vj8k5krE3Ee4Imip8KrIlQ6zJV63Ph4lxZRsaqNDJg4BX oCxFU6y1n0dnySLtbBPrWn.adSyPviN.RgIUQ4Ri_bXSu2aAjXLVV6k7aCLrpSO7Y.Ps6UiXYIFE aY.COP8AFq17YCOqtXvsWCsreLhvrXS7LrgK3jewshYvNgA81OcjVEMlTn2imnz0Zfo3aVj7NqW9 ZNhoY6U8T1_U963g4nBKR0G6LK5zPgYrkocmPSKszAjN5JwgBImvf2p8OK93m7JaTslGnFI6ex3V 9rt1RwywM7fHS7RkOwIUO10np6t.bp4uy059QXXONG1zIokDp85zYY4ohMFEVlK4b6f6J87XCA0B 14ifCuowhBEHjcEWEzjtqXxfSDcGAiEIIDta3DJ9iF.406HkQUFx6VnACbunXcW2wYHGT_oq8j6M plcs.eK1UdUSlLneNID0n1oSgnDi2OQPi7jGkGRLeEW9vSAhxpOCCZvfl7zhT88Nb8Fy.ixiYjkz adQXu5qfaDyn32PcWX7Psf0s_50EG1r2AP1p2dC96r0AxQjs.EYMYlY8b88ejF6bvuuk5eKER9UT 5lLVC_zahhSepFRnpfWMYeZcjUxuTGOZrllp5b043RrClfCFlNj2t11pybY9ZRWJrbTSnAGAASBh PZbHlungGSj9RNFOlQ4mB2OcX_3_tPcln05P4lmHTQOWT2xaJgMqqngOh1Gp1Dsz0sIW.wzBIMKG R5o5sSNcEagYMvAEDbI4iRL5Cnr2UfEMGo10kZd8HIP3ASV9CgdC7VZIrU9I1xDFMF3F0X4lXD2d nz69gtJ8F.odpm7udJhzVhr4XuUylrFnGvBd1CZi5oWe5HTK_0dfrcrHjcZNqUfhxs0PNDXNRIqx FgK2R93TTC26woyi7TNzb08g5as_hySn44sn5P1_NiPlxpwYdlJFz.nMjkNCjOXBlgRDZyqF2kZG bHZJARAYpzWio8EoC7wFlyBkTSsPFAd17vUKjSAHNxfA7QUolMqCKy0E2XNXIuEafnAKjHGmy_x6 aSafPYrHoAcfoIeh0Hc25jO0O1SMDZ0mss0zlCNuN8lT9YLVaVnErZvkCX5xAzSw9SnyZhc9CeFI blvIvQ_O5MJdAY50K5qKQt18fXfdHKcuxiSzABgS5mxNjuR5ADWM8V1ypPOeJAxN5irh1vS6WXWb OG7MutxqGNfJrqk1eOfQ6kYLYYHM1t0Z4760cDYUAR88OqSYY1nPFr_n4BnCWwLYzZCWIg15WQd1 ni69K39MLap31HeVvRqzRc3fmen2Rh6PsoHO6.c1SLHX0_xBXuMtdi62fmpTIef4OpXnnPxGXDDZ gMFV38X4_iY.d2cCFEmFn8lioevpLrROEdpUNMPiT_uoF9Eo2x5_KyvCv6P5Byh0TtSoR7NFmxka WzEQuuEcGv71MisPmHqUd3pbkRKvR90ALd0xIn2tZ1NpWQAS.AqQ79qcxlVOgbWSHcsRlGVIOXlw - X-Sonic-MF: Received: from sonic.gate.mail.ne1.yahoo.com by sonic303.consmr.mail.gq1.yahoo.com with HTTP; Sun, 23 May 2021 07:46:31 +0000 Received: by kubenode559.mail-prod1.omega.gq1.yahoo.com (VZM Hermes SMTP Server) with ESMTPA ID 27ea3b9f74aa48c6aeada5591ab1e273; Sun, 23 May 2021 07:46:29 +0000 (UTC) Content-Type: text/plain; charset=us-ascii List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.80.0.2.43\)) Subject: Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot In-Reply-To: Date: Sun, 23 May 2021 00:46:28 -0700 Cc: freebsd-current , freebsd-arm Content-Transfer-Encoding: quoted-printable Message-Id: <90A23579-61B7-4619-AFD7-68439FCC16F3@yahoo.com> References: <063D5E36-126F-497C-97AF-827BADC1ED2F.ref@yahoo.com> <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com> To: Warner Losh X-Mailer: Apple Mail (2.3654.80.0.2.43) X-Rspamd-Queue-Id: 4FnsqF20d0z3Lql X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=yahoo.com header.s=s2048 header.b=O/e+9MhO; dmarc=pass (policy=reject) header.from=yahoo.com; spf=pass (mx1.freebsd.org: domain of marklmi@yahoo.com designates 98.137.64.204 as permitted sender) smtp.mailfrom=marklmi@yahoo.com X-Spamd-Result: default: False [-3.50 / 15.00]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com:dkim]; ARC_NA(0.00)[]; RBL_DBL_DONT_QUERY_IPS(0.00)[98.137.64.204:from]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; SPAMHAUS_ZRD(0.00)[98.137.64.204:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[98.137.64.204:from]; RWL_MAILSPIKE_POSSIBLE(0.00)[98.137.64.204:from]; RCVD_COUNT_TWO(0.00)[2]; MAILMAN_DEST(0.00)[freebsd-current] Reply-To: marklmi@yahoo.com From: Mark Millard via freebsd-current X-Original-From: Mark Millard On 2021-May-23, at 00:08, Mark Millard via freebsd-current = wrote: > On 2021-May-22, at 22:16, Warner Losh wrote: >=20 >> On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm = wrote: >> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/ >> # diff -r /usr/ports/ /mnt/ | more >> nvme0: cpl does not map to outstanding cmd >> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0 >> panic: received completion for unknown cmd >>=20 >> cid 0x7e has no currently active command. The cid is used by the = driver >> to map completions back to requests. >>=20 >> So, there's usually 3 possibilities that I've seen this with. >>=20 >> (1) There's a missing cache flush so you get a bogus cpl back because = something stale >> was read. It's unlikely to be this one because the rest of this look = like a successful >> command completed: sc =3D 0 is successful completion and sct is a = generic command queued. >>=20 >> (2) We're looking at the completion record twice because we failed to = properly update the >> head pointer and we've already completed the command. I've only ever = seen this in a >> panic situation where we interrupt the completion routine because = something else >> paniced. >>=20 >> (3) There's something that's corrupting the act_tr array in the = qpair. I've not seen this, >> but if something else smashes that area (zeroing it in this case), = then that could cause >> an error like this. >=20 > Of note may be that I buildworld and buildkernel with extra > tuning enabled, targeting the cortex-a72. In one past example > this lead to finding a missing synchronization related to XHCI > handling that was fixed. (The fix was not aarch64 specific at > all.) For that: A cortex-a53 did not show the problem with or > without that tuning. A cortex-a72 showed the problem only with > the cortex-a72 tuning, not with targeting a cortex-a53 tuning > or generic armv7, for example. >=20 > Not that I've any evidence specifically suggesting such would > be involved here. But it might be good to keep in mind as a > possaibility. >=20 >> Or it could be something new I've not seen nor thought about before. >>=20 >> cpuid =3D 3 >> time =3D 1621743752 >> KDB: stack backtrace: >> db_trace_self() at db_trace_self >> db_trace_self_wrapper() at db_trace_self_wrapper+0x30 >> vpanic() at vpanic+0x188 >> panic() at panic+0x44 >> nvme_qpair_process_completions() at = nvme_qpair_process_completions+0x1fc >> nvme_timeout() at nvme_timeout+0x3c >> softclock_call_cc() at softclock_call_cc+0x124 >> softclock() at softclock+0x60 >> ithread_loop() at ithread_loop+0x2a8 >> fork_exit() at fork_exit+0x74 >> fork_trampoline() at fork_trampoline+0x14 >> KDB: enter: panic >> [ thread pid 12 tid 100028 ] >> Stopped at kdb_enter+0x48: undefined f904411f >> db>=20 >>=20 >> Based on the "nvme" references, I expect this is tied to >> handling the Optane 480 GiByte that is in the PCIe slot >> and is the boot/only media for the machine doing the diff. >>=20 >> "db> dump" seems to have worked. >>=20 >> After the reboot, zpool scrub found no errors. >>=20 >> For reference: >>=20 >> # uname -apKU >> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 = main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 = root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.= aarch64/sys/GENERIC-DBG-CA72 arm64 aarch64 1400013 1400013 >>=20 >> If you have the dump, I suggest starting to make sure that the act_tr = array looks sane. Make >> sure all the live pointers point to a sane looking tr. Make sure that = tr is on the active list, etc >>=20 >> It will take a fair amount of driver reading, though, to see how we = got here. I'd also check to >> make sure that qpair->num_enttries > cpl.cid (0x3e in this case). >>=20 >=20 > Okay. I got this while trying to test an odd diff -r over NFS > issue with the more recent software. So the two will potentially > compete for time. >=20 > As investigation will be exploratory for me, not familiar, I'll > probably publish periodic notes on things as I go along looking > at stuff. >=20 > My first is that the /var/crash/core.txt.0 has a gdb backtrace: >=20 > . . . > #10 0xffff00000047900c in panic ( > fmt=3D0x12 ) > at /usr/main-src/sys/kern/kern_shutdown.c:843 > #11 0xffff0000002226b4 in nvme_qpair_process_completions ( > qpair=3Dqpair@entry=3D0xffffa00008724300) > at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617 > #12 0xffff000000223354 in nvme_timeout = (arg=3Darg@entry=3D0xffffa0000b053980) > at /usr/main-src/sys/dev/nvme/nvme_qpair.c:938 > #13 0xffff000000495bf8 in softclock_call_cc (c=3D0xffffa0000b0539a0,=20= > cc=3Dcc@entry=3D0xffff000000de3500 , direct=3D0) > at /usr/main-src/sys/kern/kern_timeout.c:696 > #14 0xffff000000495fb0 in softclock (arg=3D0xffff000000de3500 = ) > at /usr/main-src/sys/kern/kern_timeout.c:816 > #15 0xffff0000004356dc in intr_event_execute_handlers (p=3D,=20 > ie=3D0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1168 > #16 ithread_execute_handlers (p=3D, = ie=3D0xffffa000058bc700) > at /usr/main-src/sys/kern/kern_intr.c:1181 > #17 ithread_loop (arg=3D, arg@entry=3D0xffffa000058aef60)= > at /usr/main-src/sys/kern/kern_intr.c:1269 > #18 0xffff000000431f6c in fork_exit ( > callout=3D0xffff000000435430 , = arg=3D0xffffa000058aef60,=20 > frame=3D0xffff0000eb7cc990) at = /usr/main-src/sys/kern/kern_fork.c:1083 > #19 >=20 > So via kgdb . . . >=20 > (kgdb) up 11 > #11 0xffff0000002226b4 in nvme_qpair_process_completions = (qpair=3Dqpair@entry=3D0xffffa00008724300) at = /usr/main-src/sys/dev/nvme/nvme_qpair.c:617 > 617 KASSERT(0, ("received completion for = unknown cmd")); >=20 > (kgdb) print/x cpl.cid > $4 =3D 0x7e > (kgdb) print/x qpair->num_entries > $5 =3D 0x100 >=20 > Based on also seeing the code: >=20 > qpair->act_tr =3D malloc_domainset(sizeof(struct nvme_tracker = *) * > qpair->num_entries, M_NVME, DOMAINSET_PREF(qpair->domain), > M_ZERO | M_WAITOK); >=20 > (kgdb) print qpair->act_tr > $6 =3D (struct nvme_tracker **) 0xffffa00008725800 > (kgdb) x/256g 0xffffa00008725800 > 0xffffa00008725800: 0x0000000000000000 0x0000000000000000 > 0xffffa00008725810: 0x0000000000000000 0x0000000000000000 > . . . > 0xffffa00008725fe0: 0x0000000000000000 0x0000000000000000 > 0xffffa00008725ff0: 0x0000000000000000 0x0000000000000000 >=20 > It was all zeros (null pointers). No "live" pointers and, so, > no tr's to inspect. >=20 > As none of this is familiar context beyond general programming > concepts, it may be some time before I find anything else > potentially of interest to report. If you have other specific > things you would like me to look at, let me know. >=20 A fairly obvious thing I should have provided: (kgdb) print/x *qpair $15 =3D {ctrlr =3D 0xffff0000fe154000, id =3D 0x3, domain =3D 0x0, cpu =3D= 0x2, vector =3D 0x3, rid =3D 0x4, res =3D 0xffffa000086ded80, tag =3D = 0xffffa0000877b780, num_entries =3D 0x100, num_trackers =3D 0x80,=20 sq_tdbl_off =3D 0x1018, cq_hdbl_off =3D 0x101c, phase =3D 0x1, sq_head = =3D 0x1f, sq_tail =3D 0x20, cq_head =3D 0x20, num_cmds =3D 0x420, = num_intr_handler_calls =3D 0xe66c, num_retries =3D 0x0, num_failures =3D = 0x0,=20 cmd =3D 0xffff000100ebb000, cpl =3D 0xffff000100ebf000, dma_tag =3D = 0xffffa0000b093e00, dma_tag_payload =3D 0xffffa000059ef000, queuemem_map = =3D 0xffffa00005a07700, cmd_bus_addr =3D 0xacbb000,=20 cpl_bus_addr =3D 0xacbf000, free_tr =3D {tqh_first =3D = 0xffffa0000b053a80, tqh_last =3D 0xffffa0000869da80}, outstanding_tr =3D = {tqh_first =3D 0xffffa0000b053980, tqh_last =3D 0xffffa0000b053980}, = queued_req =3D { stqh_first =3D 0x0, stqh_last =3D 0xffffa000087243c8}, act_tr =3D = 0xffffa00008725800, is_enabled =3D 0x1, lock =3D {lock_object =3D = {lo_name =3D 0xffff00000090321f, lo_flags =3D 0x1030000, lo_data =3D = 0x0,=20 lo_witness =3D 0xffffa0043fd96080}, mtx_lock =3D 0x0}} Looks like I need to boot into the non-debug builds for the other problem I'm testing for repeatability after a commit. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)