nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS: clean-then-dirty

From: Graham Perrin <grahamperrin_at_gmail.com>
Date: Sat, 17 Jul 2021 19:12:33 UTC
On 17/07/2021 16:46, Warner Losh wrote:

> On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin <grahamperrin@gmail.com 
> <mailto:grahamperrin@gmail.com>> wrote:
>     When the file system is stress-tested, it seems that the device (an
>     internal drive) is lost.
> This is most likely a drive problem. Netflix pushes half a dozen 
> different lower-end
> models of NVMe drives to their physical limits w/o seeing issues like 
> this.
> That said, our screening process screens out several low-quality 
> drives that just
> lose their minds from time to time.
>     A recent photograph:
>     <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7
>     <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7>>
>     Transcribed manually:
>     nvme0: Resetting controller due to a timeout.
>     nvme0: resetting controller
>     nvme0: controller ready did not become 0 within 5500 ms
> Here the controller failed hard. We were unable to reset it within 5 
> seconds. One might
> be able to tweak the timeouts to cope with the drive better. Do you 
> have to power cycle
> to get it to respond again?

More recently testing with FreeBSD 14.0-CURRENT installed to a mobile 
hard disk drive, with the one partition of the NVMe drive used entirely 
for test data:

* the NVMe drive is not found following a restart of FreeBSD

* the NVMe drive is found when (for example) I key F9 for HP's startup 
manager, and then I can boot (from the mobile HDD) and FreeBSD does find 
the drive again.

>     nvme0: failing outstanding i/o
>     nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64
>     nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0
>     g_vfs_done():nvd0p2[WRITE(offset=151370924032, length=32768)]error = 6
>     UFS: forcibly unmounting /dev/nvd0p2 from /
>     nvme0: failing outstanding i/o
>     … et cetera.
>     Is this a sure sign of a hardware problem? Or must I do something
>     special to gain reliability under stress?
> It's most likely a hardware problem. that said, I've been working on 
> patches to
> make the recovery when errors like this happen better.

Smart. Thanks.

>     I don't how to interpret parts of the manual page for nvme(4).
>     There's
>     direction to include this line in loader.conf(5):
>     nvme_load="YES"
>     – however when I used kldload(8), it seemed that the module was
>     already
>     loaded, or in kernel.
> Yes. If you are using it at all, you have the driver.
>     Using StressDisk:
>     <https://github.com/ncw/stressdisk
>     <https://github.com/ncw/stressdisk>>
>     – failures typically occur after around six minutes of testing.
> Do you have a number of these drives, or is it just this one bad apple?
>     The drive is very new, less than 2 TB written:
>     <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl
>     <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl>>
>     I do suspect a hardware problem, because two prior installations of
>     Windows 10 became non-bootable.
> That's likely a huge red flag.

The computer (not mine) will be in my hands for the next thirty-six 
hours or so. Then it will be seen by the assigned hardware specialist, 
who will decide how to proceed. Whether it will be taken away for a 
bench test diagnosis, I don't know. In due course I'll follow up, to the 
list, with a final outcome.

>     Also: I find peculiarities with use of fsck_ffs(8), which I can
>     describe
>     later. Maybe to be expected, if there's a problem with the drive.
> You can ask Kirk, but if data isn't written to the drive when the firmware
> crashes, then there may be data loss.
> Warner

Blind cc Kirk on this occasion.

Re: the attached typescript file, a first run of fsck performed repairs 
and marked the file system clean. A subsequent run performed repairs and 
marked the file system dirty.

I understand that with a probable hardware problem, all bets are off :-) 
but still:

* clean-then-dirty raises an eyebrow.

The version.txt file (Thursday 2021-07-15 16:12:28 BST) relates to a 
disk image that was provided to me, from which I performed the 
installation of FreeBSD that I'm currently using to test. NB the patch 
at the time.

Thanks all