[Bug 264141] nvme(4): Heavy load to SSD wedges 13.1 system: Controller in fatal status, resetting ... Resetting controller due to a timeout and possible hot unplug.

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 05 Jul 2022 23:04:50 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264141

--- Comment #23 from Warner Losh <imp@FreeBSD.org> ---
(In reply to dgilbert from comment #22)
> theory: FreeBSD is stomping on the host DRAM reserved for the NVME

There's no host ram reserved for nvme, per se. The driver will optionally
allocate memory for the drive to use, however. Do you have "nvmeX: Allocated
%lluMB host memory buffer" in your dmesg? Without it, you're not using nvme
memory. You can set the tunable hw.nvme.hmb_max=0 as well to disable using host
memory for the DRAM-less cards at the cost of some additional latency if you
think that this is the cause of the problem. This would rule it out as a
problem. There may be some cards that lose their minds when this is enabled as
well, though I've not seen reports of that in Linux world (I could easily have
missed them). Ruling this in/out would be useful...

But corrupting host memory seems unlikely to be a cause given that the card
drops off the bus and has its memory BARs reset so it isn't decoding anything
(which is what's indicated by the possible hotplug messages). This indicates
some kind of power or connection issue to the card, a faulty power controller
on the card or wonky firmware in the cases that I've diagnosed. There might be
a possible additional cause that's still unknown, but absent better evidence
I'm at a loss for where to look.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.