From nobody Sat May 29 08:15:34 2021
X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id A028AC7AA60
	for <freebsd-current@mlmmj.nyi.freebsd.org>; Sat, 29 May 2021 08:15:44 +0000 (UTC)
	(envelope-from marklmi@yahoo.com)
Received: from sonic311-23.consmr.mail.gq1.yahoo.com (sonic311-23.consmr.mail.gq1.yahoo.com [98.137.65.204])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4FsZB72SMqz4RH9
	for <freebsd-current@freebsd.org>; Sat, 29 May 2021 08:15:42 +0000 (UTC)
	(envelope-from marklmi@yahoo.com)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1622276141; bh=a1G8Y2IXEM3EUYx2C8xz26hmHmUhUCoOlnvfxzWeCA0=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject:Reply-To; b=qkL68s7NximAoNF0qPuYFGp2nGIKLYIvaTn2zpE4SIvoFojRvyv1fwa9cOqP25W3e6D93Wpwe64F+Qx83wNLI07nkzJiRRYnR7WQfjRebCBw4zQuovFpZ3avahtQsC4/7Qi52Dvg6rONjeaMi4b3qk/dhV5zOe9kfQ7R3mza148JsEc3DtxlAqIrzUguftZrxTmewldykCEiYj1lucbHlaqLMpEAL8NFp3eQ3/XDQZg/BVivnGuDGwZihX6oG9AoOkH1/Z1p0faGQOBTNz49iRwbSErxpkSMinLG9loi49mEmmDsHYTH75gGLYKr1R6clMT/eTHTGQi9wi/gf/9L8Q==
X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1622276141; bh=TcY+mGX/lvhW22avQ0zXzYqm0UKG22QpmroXZUuX9e0=; h=X-Sonic-MF:Subject:From:Date:To:From:Subject; b=blCkcIW9vJjU35myTqIsv0BXuy3lEOflBuGMn6x6AJR+mL7uY/q39FGM4wGyN0BNKiNRDtolTqyKfeQ/REOatGid6lEbcnBmvMEQFgmbUl8rsRBRFaC3qH0tIfLOnrcM4FL94JMDlXYcds9pxxX+h65JttEFTXM2Bu2tLGwjOGNhBLtr7V6SEekCW76NOgK2YGgeeEeV3annGHIL7HZRhgnu07LIR/beaJifmTWMVY+IhfhT8+fGZ6EmqA8fcYKpyBYJLsA2Nov+1AV7EtCxjQOHXgdFmAsCT+qdZmRY0TAvYUsLQaAf+TnQ3+O5Vt1zOd/KlE4vWJVg7YolH6EutA==
X-YMail-OSG: jEJOUgwVM1mSOad78CFG5xxGNRHHiuYA3HAh2M2Iap3UYK0kOhCTWHE59YiU0S5
 TAZU9YB53MJG6S0gFgtUbP1z5HrqO1y4A_TvX.NpjZU4goPMX0L2HRk736Kq.n0sVF5HUCtKMAhF
 IZNPaRGQVtRi9aLIsgsqOJYSPQI_nyeZsRO4Wtoab0GUkXLBjCogw.hFuWcEnskP802IFkYHHv15
 kCmLsK2qXGWxm3SGP18bZlJ0eWGjlTob2u6DpIRkzIJJKl60wFJ6X.uNPsNmmvrvHTOA1cOwdn94
 JqEvPg5Pq6DMPw6NMwbBgNUckFasrRPIHPdlzuLBtWnw2FftLetRcwRXv4.QxCq8vIoLPklWBOo4
 UGKS31d_lceBnnK9VQtELYBrIozdlY0ufHD1Z2y.w5nCPnoxhXnZEXk9DzWNy9gXdhb8X47mwoRe
 EOPTy2_2543BOz0HKrrrumpnU9q26SlF9RJaeB25ZSWRuF5s0QWgUWXV3y5yECVW4lYabqVyvUHA
 I5oEXYGI9SpX3UBcGqEy7xJMwUWYucSqbONhheP2pszhl2xSimsX33XN7Lge1DHGhkbtWFD6dnSE
 .p1n81AqBGaJaJLkLDOHQHuD77YvOzv2CEk6CuzxfVsbFEktKeYtwA912RDeEHs3griFBZcCaDcv
 YgV4hQkT.JqxpZSbMWGKpOEv7RM1PE_n0s7xScnLghAcJjoIu7jtWj0Al5yQ_kmCto9rHuHbidqR
 IiUiHNs0GLTEgq6NURi9.NF7YQkXoJa.BWFArrsNyZoqVZ8F6fiZghPFaMFJKJm8_idcsgFKoqlL
 ygC5BSYavUr_MZt5pI1B2FMMqWru4R5Ou_Fu5UPQ.oumyxSC_OYVfplR34lrEeL7aUz3FDSiEgKt
 bak0a79879cduUX.3HwYGEGSqdMkkE5toRUzVU28PEHOjZweE8ijDDNmsPpOL8EbwOJZjtp4z66V
 LUN.0s8LJ4BOcJhREiVDnRc3nM_Ktfe4F4duNuNyhYPWKeAKUOMZj97V7fdCFpv3wj057KgdLqgN
 AIus7Rs7a7ePgeoHg8pVbDluhjUz8lD3T0iF5kk5ew8UY9vrLxL_Zw2BFOdBpyJRbYvTFOe0GHj.
 3J60PQhePe8I5UgZ5WwyfMV.ny5QIKRsvaQUGU3EQjooLPNX9ocm8uOd_aD4d7nAiu6BJdziyOY0
 vQWcxh8HeQrWMMUXECOM7nWmnhUvp0Yt6tyGuWqLrYLgwV126KuUHx80MpPdp_zvKsaJ8aLNq6I5
 kMAoSbnSu0uq6zxGCClteZXYwizmSW_hvHS_AmEf8cYibMNQLN6fjj6rgEyfl4uEIvjpDevtn1lo
 1aksLbfUBXKGfbo9sdA5r.6yBqWE0zhpFDzMFY2KvvW61v5ZUpHwAqA3rUQ4Ze6SsQyysMJviXIe
 aeaxyt72xNAEMKA7FigEVll2kI0ejT0Ue1QYnc5JXQZhaZIXnEvjJK9UWDJyFJU8o_xeknnwL8V6
 U1qZAhBzhAGqOac5Xjh6n8tA_lrwEUMla9TEOy6GTRpKE8kvo9Ze3kowsF.dL_Z2Dbh6Qffr5Xk5
 K6BC5KRdGzwJ5M4fRNO0bkpfk3EonmIPTKkUoDGt7d2r8g4FmR3OLlV36SemfjWXwQfmdNpWu7cU
 FmUb_.QmiimaVe9SOgKKDJy.lH_dhRqPWG.sNiNk1gEGU8BWjBoNaghB6ITejrmCrKfrfGrmH7Sv
 HbTpQQN9z8iR20KSTEEr_Y6Jmxs.NETkEiMwn.qw2oS26Zf5aqb0cz.f0Skjcezh4Z_Ot.VuLR_7
 JT3epXHJhRZa4CZyTE3rEXEPQH4d2z9.vXGVZQSDm07mI.TrsESDo3YWsjcX.wmo0VAsi5Xwxw3x
 2iHz4uv_QmQItF3gVe4XfYsd6Sm.AK3dNvp9V3fYAsYPJhO0akfBXe1hdyDf2rTg0kLRG3sfjY5o
 JQnuOtGISdj2f7X2Q2gOLl.4ab2KvOT_Ej.0ohuN5cRssD_tBY3o7VlgwW517hornfdH6clnPmy.
 a._HQ3tRvZIoTyPEQpSIEML8Z38LfnRtctLtDrJutYHPzjbHKTwAeo8Lj6KSgdJ9InH5iXEheM22
 7.burGLXzKRpbDKnx2IT_NTbyvGwNdS6VTyP01K9rvSgUD_gU6fTIRUd_vcDun7x15K185jUakRv
 fLoKVmsmetiwE50s.e8h7q6rWAsYhL3stv9krvpIP.kWNrL0wEGCf9oB22rTZrKUbwRe4p5SVqLz
 POvij4Sfq6OQ1ByiTQ.dhlh1BsEvigO.S7q5GPo9.KYk4Pzu_76L0_hoDc69rN193wJnJ5AKf2A7
 QtP_3cA6.V5Vy49UpZf7wPnwJliUaPrvfMjLoxBS3U._Kw03S8PDxIcNbuMegfrqlceNXOkZyNq6
 7WcdrNBVwSPxCc._dN2QaTItTsgeIKP5kigmmtnsKlKYRPqB69IK0zo4n4anYtxwKqxAt_iBbCDG
 Tak5CzrorbS7D0mc.LCw9ffPqL27d7acpgrXkoXVDm2RbHSeDcDbslKQe6Iul7pwraOYgBjfUh9U
 rdnEKmkvv48oxjCi7JwRR0c6ldJFGXlqzzABuren.fS51uexVF85YMZF0yzW1J16a5YSrOMWp6vR
 SJo8jR7cry7zkJOkfsk5PIG.gT6KLbLFEcmYhg22acG51R1kJ9A.39_.GQf_DetUtRazGtX4-
X-Sonic-MF: <marklmi@yahoo.com>
Received: from sonic.gate.mail.ne1.yahoo.com by sonic311.consmr.mail.gq1.yahoo.com with HTTP; Sat, 29 May 2021 08:15:41 +0000
Received: by kubenode550.mail-prod1.omega.bf1.yahoo.com (VZM Hermes SMTP Server) with ESMTPA ID 6cc552a86c5688d6360502b8cdeeca05;
          Sat, 29 May 2021 08:15:37 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.80.0.2.43\))
Subject: Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on
 a MACHIATObin Double Shot
In-Reply-To: <90A23579-61B7-4619-AFD7-68439FCC16F3@yahoo.com>
Date: Sat, 29 May 2021 01:15:34 -0700
Cc: freebsd-current <freebsd-current@freebsd.org>,
 freebsd-arm <freebsd-arm@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <C9C2AF07-5E29-446F-BBA6-1D0924C1DF92@yahoo.com>
References: <063D5E36-126F-497C-97AF-827BADC1ED2F.ref@yahoo.com>
 <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com>
 <CANCZdfqcrkC=VAuvB_i=sR=7rp_e8KeVX=y2apXj7E5De9eG8g@mail.gmail.com>
 <E5776EDA-8C2E-4309-9B90-9F402194F5BF@yahoo.com>
 <90A23579-61B7-4619-AFD7-68439FCC16F3@yahoo.com>
To: Warner Losh <imp@bsdimp.com>
X-Mailer: Apple Mail (2.3654.80.0.2.43)
X-Rspamd-Queue-Id: 4FsZB72SMqz4RH9
X-Spamd-Bar: -
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=yahoo.com header.s=s2048 header.b=qkL68s7N;
	dmarc=pass (policy=reject) header.from=yahoo.com;
	spf=pass (mx1.freebsd.org: domain of marklmi@yahoo.com designates 98.137.65.204 as permitted sender) smtp.mailfrom=marklmi@yahoo.com
X-Spamd-Result: default: False [-1.50 / 15.00];
	 FREEMAIL_FROM(0.00)[yahoo.com];
	 MV_CASE(0.50)[];
	 R_SPF_ALLOW(-0.20)[+ptr:yahoo.com];
	 TO_DN_ALL(0.00)[];
	 DKIM_TRACE(0.00)[yahoo.com:+];
	 DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
	 FROM_EQ_ENVFROM(0.00)[];
	 RCVD_TLS_LAST(0.00)[];
	 MIME_TRACE(0.00)[0:+];
	 FREEMAIL_ENVFROM(0.00)[yahoo.com];
	 ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US];
	 RBL_DBL_DONT_QUERY_IPS(0.00)[98.137.65.204:from];
	 DWL_DNSWL_NONE(0.00)[yahoo.com:dkim];
	 MID_RHS_MATCH_FROM(0.00)[];
	 ARC_NA(0.00)[];
	 R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
	 NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	 FROM_HAS_DN(0.00)[];
	 RCPT_COUNT_THREE(0.00)[3];
	 NEURAL_SPAM_SHORT(1.00)[1.000];
	 NEURAL_HAM_LONG(-1.00)[-1.000];
	 MIME_GOOD(-0.10)[text/plain];
	 SPAMHAUS_ZRD(0.00)[98.137.65.204:from:127.0.2.255];
	 TO_MATCH_ENVRCPT_SOME(0.00)[];
	 RCVD_IN_DNSWL_NONE(0.00)[98.137.65.204:from];
	 RWL_MAILSPIKE_POSSIBLE(0.00)[98.137.65.204:from];
	 RCVD_COUNT_TWO(0.00)[2];
	 MAILMAN_DEST(0.00)[freebsd-current]
Reply-To: marklmi@yahoo.com
From: Mark Millard via freebsd-current <freebsd-current@freebsd.org>
X-Original-From: Mark Millard <marklmi@yahoo.com>
X-ThisMailContainsUnwantedMimeParts: N

On 2021-May-23, at 00:46, Mark Millard via freebsd-current =
<freebsd-current at freebsd.org> wrote:

> On 2021-May-23, at 00:08, Mark Millard via freebsd-current =
<freebsd-current at freebsd.org> wrote:
>=20
>> On 2021-May-22, at 22:16, Warner Losh <imp at bsdimp.com> wrote:
>>=20
>>> On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm =
<freebsd-arm@freebsd.org> wrote:
>>> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
>>> # diff -r /usr/ports/ /mnt/ | more
>>> nvme0: cpl does not map to outstanding cmd
>>> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
>>> panic: received completion for unknown cmd
>>>=20
>>> cid 0x7e has no currently active command. The cid is used by the =
driver
>>> to map completions back to requests.
>>>=20
>>> So, there's usually 3 possibilities that I've seen this with.
>>>=20
>>> (1) There's a missing cache flush so you get a bogus cpl back =
because something stale
>>> was read. It's unlikely to be this one because the rest of this look =
like a successful
>>> command completed: sc =3D 0 is successful completion and sct is a =
generic command queued.
>>>=20
>>> (2) We're looking at the completion record twice because we failed =
to properly update the
>>> head pointer and we've already completed the command. I've only ever =
seen this in a
>>> panic situation where we interrupt the completion routine because =
something else
>>> paniced.
>>>=20
>>> (3) There's something that's corrupting the act_tr array in the =
qpair. I've not seen this,
>>> but if something else smashes that area (zeroing it in this case), =
then that could cause
>>> an error like this.
>>=20
>> Of note may be that I buildworld and buildkernel with extra
>> tuning enabled, targeting the cortex-a72. In one past example
>> this lead to finding a missing synchronization related to XHCI
>> handling that was fixed. (The fix was not aarch64 specific at
>> all.) For that: A cortex-a53 did not show the problem with or
>> without that tuning. A cortex-a72 showed the problem only with
>> the cortex-a72 tuning, not with targeting a cortex-a53 tuning
>> or generic armv7, for example.
>>=20
>> Not that I've any evidence specifically suggesting such would
>> be involved here. But it might be good to keep in mind as a
>> possaibility.
>>=20
>>> Or it could be something new I've not seen nor thought about before.
>>>=20
>>> cpuid =3D 3
>>> time =3D 1621743752
>>> KDB: stack backtrace:
>>> db_trace_self() at db_trace_self
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>>> vpanic() at vpanic+0x188
>>> panic() at panic+0x44
>>> nvme_qpair_process_completions() at =
nvme_qpair_process_completions+0x1fc
>>> nvme_timeout() at nvme_timeout+0x3c
>>> softclock_call_cc() at softclock_call_cc+0x124
>>> softclock() at softclock+0x60
>>> ithread_loop() at ithread_loop+0x2a8
>>> fork_exit() at fork_exit+0x74
>>> fork_trampoline() at fork_trampoline+0x14
>>> KDB: enter: panic
>>> [ thread pid 12 tid 100028 ]
>>> Stopped at      kdb_enter+0x48: undefined       f904411f
>>> db>=20
>>>=20
>>> Based on the "nvme" references, I expect this is tied to
>>> handling the Optane 480 GiByte that is in the PCIe slot
>>> and is the boot/only media for the machine doing the diff.
>>>=20
>>> "db> dump" seems to have worked.
>>>=20
>>> After the reboot, zpool scrub found no errors.
>>>=20
>>> For reference:
>>>=20
>>> # uname -apKU
>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 =
main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021     =
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.=
aarch64/sys/GENERIC-DBG-CA72  arm64 aarch64 1400013 1400013
>>>=20
>>> If you have the dump, I suggest starting to make sure that the =
act_tr array looks sane. Make
>>> sure all the live pointers point to a sane looking tr. Make sure =
that tr is on the active list, etc
>>>=20
>>> It will take a fair amount of driver reading, though, to see how we =
got here. I'd also check to
>>> make sure that qpair->num_enttries > cpl.cid (0x3e in this case).
>>>=20
>>=20
>> Okay. I got this while trying to test an odd diff -r over NFS
>> issue with the more recent software. So the two will potentially
>> compete for time.
>>=20
>> As investigation will be exploratory for me, not familiar, I'll
>> probably publish periodic notes on things as I go along looking
>> at stuff.
>>=20
>> My first is that the /var/crash/core.txt.0 has a gdb backtrace:
>>=20
>> . . .
>> #10 0xffff00000047900c in panic (
>>   fmt=3D0x12 <error: Cannot access memory at address 0x12>)
>>   at /usr/main-src/sys/kern/kern_shutdown.c:843
>> #11 0xffff0000002226b4 in nvme_qpair_process_completions (
>>   qpair=3Dqpair@entry=3D0xffffa00008724300)
>>   at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617
>> #12 0xffff000000223354 in nvme_timeout =
(arg=3Darg@entry=3D0xffffa0000b053980)
>>   at /usr/main-src/sys/dev/nvme/nvme_qpair.c:938
>> #13 0xffff000000495bf8 in softclock_call_cc (c=3D0xffffa0000b0539a0,=20=

>>   cc=3Dcc@entry=3D0xffff000000de3500 <cc_cpu+768>, direct=3D0)
>>   at /usr/main-src/sys/kern/kern_timeout.c:696
>> #14 0xffff000000495fb0 in softclock (arg=3D0xffff000000de3500 =
<cc_cpu+768>)
>>   at /usr/main-src/sys/kern/kern_timeout.c:816
>> #15 0xffff0000004356dc in intr_event_execute_handlers (p=3D<optimized =
out>,=20
>>   ie=3D0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1168
>> #16 ithread_execute_handlers (p=3D<optimized out>, =
ie=3D0xffffa000058bc700)
>>   at /usr/main-src/sys/kern/kern_intr.c:1181
>> #17 ithread_loop (arg=3D<optimized out>, =
arg@entry=3D0xffffa000058aef60)
>>   at /usr/main-src/sys/kern/kern_intr.c:1269
>> #18 0xffff000000431f6c in fork_exit (
>>   callout=3D0xffff000000435430 <ithread_loop>, =
arg=3D0xffffa000058aef60,=20
>>   frame=3D0xffff0000eb7cc990) at =
/usr/main-src/sys/kern/kern_fork.c:1083
>> #19 <signal handler called>
>>=20
>> So via kgdb . . .
>>=20
>> (kgdb) up 11
>> #11 0xffff0000002226b4 in nvme_qpair_process_completions =
(qpair=3Dqpair@entry=3D0xffffa00008724300) at =
/usr/main-src/sys/dev/nvme/nvme_qpair.c:617
>> 617				KASSERT(0, ("received completion for =
unknown cmd"));
>>=20
>> (kgdb) print/x cpl.cid
>> $4 =3D 0x7e
>> (kgdb) print/x qpair->num_entries
>> $5 =3D 0x100
>>=20
>> Based on also seeing the code:
>>=20
>>       qpair->act_tr =3D malloc_domainset(sizeof(struct nvme_tracker =
*) *
>>           qpair->num_entries, M_NVME, DOMAINSET_PREF(qpair->domain),
>>           M_ZERO | M_WAITOK);
>>=20
>> (kgdb) print qpair->act_tr
>> $6 =3D (struct nvme_tracker **) 0xffffa00008725800
>> (kgdb) x/256g 0xffffa00008725800
>> 0xffffa00008725800:	0x0000000000000000	0x0000000000000000
>> 0xffffa00008725810:	0x0000000000000000	0x0000000000000000
>> . . .
>> 0xffffa00008725fe0:	0x0000000000000000	0x0000000000000000
>> 0xffffa00008725ff0:	0x0000000000000000	0x0000000000000000
>>=20
>> It was all zeros (null pointers). No "live" pointers and, so,
>> no tr's to inspect.
>>=20
>> As none of this is familiar context beyond general programming
>> concepts, it may be some time before I find anything else
>> potentially of interest to report. If you have other specific
>> things you would like me to look at, let me know.
>>=20
>=20
> A fairly obvious thing I should have provided:
>=20
> (kgdb) print/x *qpair
> $15 =3D {ctrlr =3D 0xffff0000fe154000, id =3D 0x3, domain =3D 0x0, cpu =
=3D 0x2, vector =3D 0x3, rid =3D 0x4, res =3D 0xffffa000086ded80, tag =3D =
0xffffa0000877b780, num_entries =3D 0x100, num_trackers =3D 0x80,=20
>  sq_tdbl_off =3D 0x1018, cq_hdbl_off =3D 0x101c, phase =3D 0x1, =
sq_head =3D 0x1f, sq_tail =3D 0x20, cq_head =3D 0x20, num_cmds =3D =
0x420, num_intr_handler_calls =3D 0xe66c, num_retries =3D 0x0, =
num_failures =3D 0x0,=20
>  cmd =3D 0xffff000100ebb000, cpl =3D 0xffff000100ebf000, dma_tag =3D =
0xffffa0000b093e00, dma_tag_payload =3D 0xffffa000059ef000, queuemem_map =
=3D 0xffffa00005a07700, cmd_bus_addr =3D 0xacbb000,=20
>  cpl_bus_addr =3D 0xacbf000, free_tr =3D {tqh_first =3D =
0xffffa0000b053a80, tqh_last =3D 0xffffa0000869da80}, outstanding_tr =3D =
{tqh_first =3D 0xffffa0000b053980, tqh_last =3D 0xffffa0000b053980}, =
queued_req =3D {
>    stqh_first =3D 0x0, stqh_last =3D 0xffffa000087243c8}, act_tr =3D =
0xffffa00008725800, is_enabled =3D 0x1, lock =3D {lock_object =3D =
{lo_name =3D 0xffff00000090321f, lo_flags =3D 0x1030000, lo_data =3D =
0x0,=20
>      lo_witness =3D 0xffffa0043fd96080}, mtx_lock =3D 0x0}}
>=20
> Looks like I need to boot into the non-debug builds for the
> other problem I'm testing for repeatability after a commit.


I've no figured out anything interesting so far.

But I've run into something that looks odd to me
(not that I've any evidence it is related to the
panic, more likely my ignorance):

There is a use of atomic_store_rel_int(&qpair->cq_head, 0)
for which I do not find any matching atomic_load_acq_int
use (or other explicit _acq), so so there is no "synchronizes
with" status in the code to establish an ordering across
threads that involves the atomic_store_rel_int that I've
found, just implicit/default relaxed-load-operations. A grep
for "cq_head" under dev/nvme/ shows only:

./dev/nvme/nvme_private.h:      uint32_t                cq_head;
./dev/nvme/nvme_sysctl.c:       SYSCTL_ADD_UINT(ctrlr_ctx, que_list, =
OID_AUTO, "cq_head",
./dev/nvme/nvme_sysctl.c:           CTLFLAG_RD, &qpair->cq_head, 0,
./dev/nvme/nvme_qpair.c:         * below, but before we can reset =
cq_head to zero at 2. Also cope with
./dev/nvme/nvme_qpair.c:                if (qpair->cq_head =3D=3D =
qpair->num_entries) {
./dev/nvme/nvme_qpair.c:                         * Here we know that we =
need to zero cq_head and then negate
./dev/nvme/nvme_qpair.c:                         * the phase, which =
hasn't been assigned if cq_head isn't
./dev/nvme/nvme_qpair.c:                        qpair->cq_head =3D 0;
./dev/nvme/nvme_qpair.c:                } else if (qpair->cq_head =3D=3D =
0) {
./dev/nvme/nvme_qpair.c:                cpl =3D =
qpair->cpl[qpair->cq_head];
./dev/nvme/nvme_qpair.c:                         * qpair->cq_head at 1 =
below.  Later, we re-enter this
./dev/nvme/nvme_qpair.c:                         * won't have updated =
cq_head. Rather than panic again,
./dev/nvme/nvme_qpair.c:                        =
nvme_dump_completion(&qpair->cpl[qpair->cq_head]);
./dev/nvme/nvme_qpair.c:                if (++qpair->cq_head =3D=3D =
qpair->num_entries) {           /* 1 */
./dev/nvme/nvme_qpair.c:                        =
atomic_store_rel_int(&qpair->cq_head, 0);       /* 2 */
./dev/nvme/nvme_qpair.c:                    qpair->cq_hdbl_off, =
qpair->cq_head);
./dev/nvme/nvme_qpair.c:        qpair->sq_head =3D qpair->sq_tail =3D =
qpair->cq_head =3D 0;

(2 lines above the last has the atomic_store_rel_int use.)

atomic_thread_fence_rel use would have "synchronizes
with" based on ordinary loads reading something stored
after the atomic_thread_fence_rel. Such is documented
in "man atomic". But that is not what this code is doing.
"man atomic" does not mention ordinary loads getting such
a status by reading what an atomic_store_rel_int wrote.
It does reference the atomic_thread_fence_rel related
status for ordinary loads.

So I'm clueless about what is intended to be going on
relative to that "atomic_store_rel_int(&qpair->cq_head, 0)".
Overall, the code does not appear to me to match up with the
aarch64, powerpc64, or powerpc code generation requirements
to have any matching "synchronizes with" relationships.
(I'll not list machine instruction sequences here. I'd have
to look up the detailed sequences. But, as I remember, more
than a load is involved in the code sequences for the
acquire side on these types of processors. Nothing in the
source indicates to generate the additional code as far as
I can tell.)

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)