From nobody Tue Apr 01 06:34:16 2025 X-Original-To: bugs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4ZRdWY1nJMz5s36G for ; Tue, 01 Apr 2025 06:34:17 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R10" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4ZRdWY0MMNz3hBk for ; Tue, 01 Apr 2025 06:34:17 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1743489257; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M/b79i3pfEh64ivS4YXGcElwtOEU6YW+lkH1jAnyoAU=; b=gBAbhGGHP/dFaaHMRw40idH2DpR3NIowgAr26c9Aq1SgxDQKfwap/PgHVHjVIDFWJ5g0pb 2r6gBBNb3j8OoyHmJPkHPTNNpxzcG/UNRcv44b6WyinS+/I/bCU4Fv96YxOGrNnGVdTdAt 8UjPHFjXWAULMKel9S1lXA1GL6cCKxXFxXY6Ghn4mA4Fv8pa6Nq1VCo9cikXr1ubEK1Ugj Oi8/7EdN4J3pz4jZQYoLncrLHT2QfgoAmSsNrHO1vgSa2um9QLdeXoolMXV1A3w+4/zY9+ MAoOdY6ZIU3V/i9eOazaKgrqvlxmSXzpu+sKrnTEP37aRY/Cybh37Mu1uAEeFg== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1743489257; a=rsa-sha256; cv=none; b=UDtXc6NOB10dKiRE0BUP+0Moj8MlFsVg0Wja8bF5lAJ5erZMtB+VnPadxw8RyUkjQdNSGY uyJSYugZyLnAPvtQKCkMJds7wp4Ysj9pHkcpPPuN4KQEJo2y7dvKUhtIobA3RhXBz+eTJo wSSzzRTXwc1feOa7hFbK6v0FWNPqmUL5CNUcxidVn1iCirmRexdwtKyt5WnXcp531pK+3N ca4ycPTODRXcNJoLCrvQeN9AgbzR1YLO6zy+YauEX0/4lixggJPgOIuyIhvH1JGZ0fwRb3 snZZPykrh3BIsE0XS3bT3o01ZaBKXssPxycq8oCMc86voAAxn4fK3CniPmPA9A== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1743489257; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M/b79i3pfEh64ivS4YXGcElwtOEU6YW+lkH1jAnyoAU=; b=DKz+eQYBs9z5JUI2sQbrA5DUqfxLOucIYC/ncvzrLXeXe2jJB7zNcbtWa7dUm14Dpiui9y nyNJIp0qi2Jv4hhDBa7rlOGa1frWZvQ111FRAlt68L+sQbMHXVdGveHUX87koMtpZVkiZv q859zTc5NFmOYmh/zKR1L2pPD6Wtvqx4JaSsqhCivfF6Ifo2gI4qBNhp0RThD+/1aziDZ1 6WP9Z2siOH9PcQTE6Byj7gP7ZPL/TRxfAbs7bhQuJts8etagF3rAbsOJAEHhBcXwVnoy9Y AaMQZi+6bNVxuKmyAT0fHSBbzPDFtSh3fU8pHj8Le1s9mRcQMXpGk853a5oPjA== Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4ZRdWX6kc3z40C for ; Tue, 01 Apr 2025 06:34:16 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 5316YG2l097078 for ; Tue, 1 Apr 2025 06:34:16 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 5316YGRC097077 for bugs@FreeBSD.org; Tue, 1 Apr 2025 06:34:16 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 283189] Sporadic NVMe DMAR faults since updating to 14.2-STABLE Date: Tue, 01 Apr 2025 06:34:16 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 14.2-STABLE X-Bugzilla-Keywords: regression X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: jah@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Bug reports List-Archive: https://lists.freebsd.org/archives/freebsd-bugs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-bugs@FreeBSD.org MIME-Version: 1.0 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D283189 Jason A. Harmening changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |imp@FreeBSD.org, | |jhb@FreeBSD.org, | |kib@FreeBSD.org --- Comment #2 from Jason A. Harmening --- I'm still seeing these faults every few days or so; still always NVMe write= s, and still always on the prp1 segment of a small (usually 8 or 16 bytes, I d= on't think I've ever seen > 128 bytes) transfer. I haven't found any hardware or BIOS errata that would explain these faults= .=20 It really seems like a SW issue, especially since it started immediately af= ter updating from 13 to 14. That said, I also haven't found any change between= 13 and 14 in either the IOMMU or NVMe driver stacks that would seem to explain this. It might just be that 14 brought some change in workload (maybe even= at the ZFS or block I/O layers) that exposed an existing bug. So far this this hasn't had visible consequences: no impact to the stabilit= y of the machine, no impact to data *that I care about*. I've never seen this e= rror while explicitly saving a file, updating packages, or doing builds (includi= ng heavily multithreaded world/kernel builds). It just seems to randomly happ= en when the machine is mostly idle, with no clue as to what userspace activity triggered the fault. All errors have happened on writes, so ZFS scrub show= s no data integrity issues. This is still a pretty annoying and concerning thing though. Based on the fallout we saw when we briefly tried to enable DMAR by default in -current = last summer, I'm suspicious that maybe others haven't reported this only because very few people have DMAR enabled in the first place. We might therefore be= in for some trouble the next time we try to enable it by default. Other random thoughts: --Would instrumenting the NVMe error completion path with dtrace be useful, maybe in determining what userspace activity triggers this? --Could this just be an issue with something in the busdma mapping path not correctly mapping these small transfers? I looked through nvme_qpair.c and= the common IOMMU GAS mapping routines and didn't see any obvious problem. --Would it be useful to try disabling QI to see if the problem goes away?=20 Could this be some QI-related race e.g. leaving a stale write-only IOTLB en= try in place while the same VA is reused for an NVMe write (memory read) operat= ion? CCing the DMAR/NVMe experts, any debugging advice is much appreciated. --=20 You are receiving this mail because: You are the assignee for the bug.=