[Bug 283189] Sporadic NVMe DMAR faults since updating to 14.2-STABLE
Date: Tue, 01 Apr 2025 06:34:16 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283189 Jason A. Harmening <jah@FreeBSD.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |imp@FreeBSD.org, | |jhb@FreeBSD.org, | |kib@FreeBSD.org --- Comment #2 from Jason A. Harmening <jah@FreeBSD.org> --- I'm still seeing these faults every few days or so; still always NVMe writes, and still always on the prp1 segment of a small (usually 8 or 16 bytes, I don't think I've ever seen > 128 bytes) transfer. I haven't found any hardware or BIOS errata that would explain these faults. It really seems like a SW issue, especially since it started immediately after updating from 13 to 14. That said, I also haven't found any change between 13 and 14 in either the IOMMU or NVMe driver stacks that would seem to explain this. It might just be that 14 brought some change in workload (maybe even at the ZFS or block I/O layers) that exposed an existing bug. So far this this hasn't had visible consequences: no impact to the stability of the machine, no impact to data *that I care about*. I've never seen this error while explicitly saving a file, updating packages, or doing builds (including heavily multithreaded world/kernel builds). It just seems to randomly happen when the machine is mostly idle, with no clue as to what userspace activity triggered the fault. All errors have happened on writes, so ZFS scrub shows no data integrity issues. This is still a pretty annoying and concerning thing though. Based on the fallout we saw when we briefly tried to enable DMAR by default in -current last summer, I'm suspicious that maybe others haven't reported this only because very few people have DMAR enabled in the first place. We might therefore be in for some trouble the next time we try to enable it by default. Other random thoughts: --Would instrumenting the NVMe error completion path with dtrace be useful, maybe in determining what userspace activity triggers this? --Could this just be an issue with something in the busdma mapping path not correctly mapping these small transfers? I looked through nvme_qpair.c and the common IOMMU GAS mapping routines and didn't see any obvious problem. --Would it be useful to try disabling QI to see if the problem goes away? Could this be some QI-related race e.g. leaving a stale write-only IOTLB entry in place while the same VA is reused for an NVMe write (memory read) operation? CCing the DMAR/NVMe experts, any debugging advice is much appreciated. -- You are receiving this mail because: You are the assignee for the bug.