[Bug 211713] NVME controller failure: resetting
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Wed Mar 15 06:05:33 UTC 2017
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713
--- Comment #19 from Warner Losh <imp at FreeBSD.org> ---
My samsung 960 PRO works great. We have other (hundreds) drives at work that
are doing close to 3.8GB/s steady for hours.... So it can work... Let's dig
down a level...
So the 'reset' messages that Terry is seeing in the two screen shots he just
posted are either the result of some prankster doing an nvmecontrol reset
(quite unlikely), or the result of the driver calling reset internally. It does
this only when it gets a timeout for a command. Assuming for the moment that
the timeout code is good, there's a command that's coming back bad and we wind
up here:
nvme_timeout(void *arg)
...
/* Read csts to get value of cfs - controller fatal status. */
csts.raw = nvme_mmio_read_4(ctrlr, csts);
if (ctrlr->enable_aborts && csts.bits.cfs == 0) {
/*
* If aborts are enabled, only use them if the controller is
* not reporting fatal status.
*/
nvme_ctrlr_cmd_abort(ctrlr, tr->cid, qpair->id,
nvme_abort_complete, tr);
} else
nvme_ctrlr_reset(ctrlr);
so we read the CSTS (the controller status) and if we've enabled aborts (which
you can do by setting the tunable hw.nvme.enable_aborts=1 (it defaults to 0, so
that's the path we may be taking unless you've found this already), so we do a
reset.
The reset turns out to be unsuccessful, and we drive off the road into the
ditch with the follow-on errors.
So, maybe try to set the tunable and try again. I'd normally ask about all the
stupid issues: is power good, are the connections good, are you seeing PCIe
errors (pciconf -lbace nvmeX), etc here, but I kinda assume with so many
reports that's unlikely to be fruitful to everybody.
Maybe I'll try to find a Samsung 950 Pro 512GB (which form factor do you have?)
and try as well, but that process will take about a week or two since I have an
offsite soon and I don't think I can get one here before then.
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-bugs
mailing list