nvme detached
Graham Perrin
grahamperrin at gmail.com
Wed Aug 4 17:35:24 UTC 2021
On 04/08/2021 18:08, Dan Langille wrote:
> Yesterday I had an NVME stick detach. This degraded a zpool but zpools status indicated the device was still online. Yet it was not visible in /dev/.
>
> More details are at https://gist.github.com/dlangille/bc8af0f5a098d3a106fa5fbf40a88d42
>
> I first noticed the issue with multiple ssh sessions freezing up.
>
> Then Nagios started alerting. A reboot cleared this up. scrubs did not find any errors.
>
> The /var/log/messages entries below.
>
> Thank you.
>
> Aug 3 15:06:02 knew kernel: nvme0: Resetting controller due to a timeout.
> Aug 3 15:06:02 knew kernel: nvme0: resetting controller
> Aug 3 15:06:32 knew kernel: nvme0: controller ready did not become 0 within 30500 ms
> Aug 3 15:06:32 knew kernel: nvme0: failing queued i/o
> Aug 3 15:06:32 knew kernel: nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
> Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:0 cdw0:0
> Aug 3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug 3 15:06:32 knew kernel: nvme0: READ sqid:2 cid:123 nsid:1 lba:250153507 len:5
> Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:123 cdw0:0
> Aug 3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug 3 15:06:32 knew kernel: nvme0: WRITE sqid:3 cid:118 nsid:1 lba:454009346 len:1
> Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:118 cdw0:0
> Aug 3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug 3 15:06:32 knew kernel: nvme0: WRITE sqid:4 cid:122 nsid:1 lba:454009345 len:1
> Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:122 cdw0:0
> Aug 3 15:06:32 knew kernel: nvd0: detached
>
The STATE peculiarity aside: if you have a spare, to replace what's
currently at nvd0, I should put it in place.
Then stress test the removed stick, to tell whether it's good for reuse.
A normal run of StressDesk might be enough to expose a problem; I
recently had a new drive (less than 100 hours' use) that failed
consistently after around seven minutes of the run (before filling the
file UFS system).
More information about the freebsd-questions
mailing list