nvme detached

Wed Aug 4 17:35:24 UTC 2021

On 04/08/2021 18:08, Dan Langille wrote:
> Yesterday I had an NVME stick detach.  This degraded a zpool but zpools status indicated the device was still online. Yet it was not visible in /dev/.
>
> More details are at https://gist.github.com/dlangille/bc8af0f5a098d3a106fa5fbf40a88d42
>
> I first noticed the issue with multiple ssh sessions freezing up.
>
> Then Nagios started alerting. A reboot cleared this up. scrubs did not find any errors.
>
> The /var/log/messages entries below.
>
> Thank you.
>
> Aug  3 15:06:02 knew kernel: nvme0: Resetting controller due to a timeout.
> Aug  3 15:06:02 knew kernel: nvme0: resetting controller
> Aug  3 15:06:32 knew kernel: nvme0: controller ready did not become 0 within 30500 ms
> Aug  3 15:06:32 knew kernel: nvme0: failing queued i/o
> Aug  3 15:06:32 knew kernel: nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:0 cdw0:0
> Aug  3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug  3 15:06:32 knew kernel: nvme0: READ sqid:2 cid:123 nsid:1 lba:250153507 len:5
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:123 cdw0:0
> Aug  3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug  3 15:06:32 knew kernel: nvme0: WRITE sqid:3 cid:118 nsid:1 lba:454009346 len:1
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:118 cdw0:0
> Aug  3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug  3 15:06:32 knew kernel: nvme0: WRITE sqid:4 cid:122 nsid:1 lba:454009345 len:1
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:122 cdw0:0
> Aug  3 15:06:32 knew kernel: nvd0: detached
>
The STATE peculiarity aside: if you have a spare, to replace what's 
currently at nvd0, I should put it in place.

Then stress test the removed stick, to tell whether it's good for reuse.

A normal run of StressDesk might be enough to expose a problem; I 
recently had a new drive (less than 100 hours' use) that failed 
consistently after around seven minutes of the run (before filling the 
file UFS system).