Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS

In reply to: Warner Losh : "Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Zaphod Beeblebrox <zbeeble_at_gmail.com>
Date: Sat, 17 Jul 2021 18:30:41 UTC
One thing, that I'm sure the developers know, but that might be
underappreciated at the user level:

These things are little computers ... with their own little operating
systems and as such, their own little bugs.  This means that the quality
can swing very wildly between different examples of cheap dodgy hardware.
I mean... it's also true that disk drives have been run by microcontrollers
for 20-odd-years (or more), but the sheer number of vendors who can
contract with PCBwaaaay (favorite utuber in my head, sorry) and get an NVMe
drive completely manufactured and onto Amazon is somewhat unprecedented.

Dodgy hardware doesn't need a factory anymore.

On Sat, Jul 17, 2021 at 11:48 AM Warner Losh <imp@bsdimp.com> wrote:

> On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin <grahamperrin@gmail.com>
> wrote:
>
> > When the file system is stress-tested, it seems that the device (an
> > internal drive) is lost.
> >
>
> This is most likely a drive problem. Netflix pushes half a dozen different
> lower-end
> models of NVMe drives to their physical limits w/o seeing issues like this.
>
> That said, our screening process screens out several low-quality drives
> that just
> lose their minds from time to time.
>
>
> > A recent photograph:
> >
> > <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7>
> >
> > Transcribed manually:
> >
> > nvme0: Resetting controller due to a timeout.
> > nvme0: resetting controller
> > nvme0: controller ready did not become 0 within 5500 ms
> >
>
> Here the controller failed hard. We were unable to reset it within 5
> seconds. One might
> be able to tweak the timeouts to cope with the drive better. Do you have to
> power cycle
> to get it to respond again?
>
>
> > nvme0: failing outstanding i/o
> > nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64
> > nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0
> > g_vfs_done():nvd0p2[WRITE(offset=151370924032, length=32768)]error = 6
> > UFS: forcibly unmounting /dev/nvd0p2 from /
> > nvme0: failing outstanding i/o
> >
> > … et cetera.
> >
> > Is this a sure sign of a hardware problem? Or must I do something
> > special to gain reliability under stress?
> >
>
> It's most likely a hardware problem. that said, I've been working on
> patches to
> make the recovery when errors like this happen better.
>
>
> > I don't how to interpret parts of the manual page for nvme(4). There's
> > direction to include this line in loader.conf(5):
> >
> > nvme_load="YES"
> >
> > – however when I used kldload(8), it seemed that the module was already
> > loaded, or in kernel.
> >
>
> Yes. If you are using it at all, you have the driver.
>
>
> > Using StressDisk:
> >
> > <https://github.com/ncw/stressdisk>
> >
> > – failures typically occur after around six minutes of testing.
> >
>
> Do you have a number of these drives, or is it just this one bad apple?
>
>
> > The drive is very new, less than 2 TB written:
> >
> > <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl>
> >
> > I do suspect a hardware problem, because two prior installations of
> > Windows 10 became non-bootable.
> >
>
> That's likely a huge red flag.
>
>
> > Also: I find peculiarities with use of fsck_ffs(8), which I can describe
> > later. Maybe to be expected, if there's a problem with the drive.
> >
>
> You can ask Kirk, but if data isn't written to the drive when the firmware
> crashes, then there may be data loss.
>
> Warner
>