From nobody Sat Jul 17 18:30:41 2021 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 51C1D12469B4 for ; Sat, 17 Jul 2021 18:30:56 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-ed1-x536.google.com (mail-ed1-x536.google.com [IPv6:2a00:1450:4864:20::536]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4GRxWN1hMgz4Ys1 for ; Sat, 17 Jul 2021 18:30:55 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: by mail-ed1-x536.google.com with SMTP id ee25so17343183edb.5 for ; Sat, 17 Jul 2021 11:30:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=iOT41sEn0Ih4fJZt9CI1x9Z/SNoMg5lEYIWI7S0CY+I=; b=ELp6wtY94dv1zvqpxPp63fpD1jDgFXAN66CMlt/664xV5t7R/3GTjiHRBddyCBrzPL X7zpXYk1gZk+IbINs5X4gePUNXYhy9/UpzE09mMSslN9G4qrAyfMw7t2/V7W0QXWKZre iaKFJliTt3YFyuAGz6vI6p1hJaXivp0/HexOLY2HO84d/Ytj7Ch0ouaO0zpvEncRs4xk KeCI/uixqTaXhmV6QJ0W+l+WjQWD2Z/piK3NKxINCVb10KJSzv0cclYu0YSUAydIDPxP HNqbFOfjuVD9vpZDxswi2V6WPGmqJlzsm45gCkQVNdycSah4LC21eo/mWXBJqC9+kSti SQEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=iOT41sEn0Ih4fJZt9CI1x9Z/SNoMg5lEYIWI7S0CY+I=; b=s3LUu+Bru25eBL7YHjNYK65Lqy+tlUT8GQq3UPHbuGdV377AG+NySmTaA4RDHcudB5 sAa4No0HKa4RWQfyKaAnYIszRfL5o9MLqqVUvJ5/U0XC3LTjqKZH9DqyxuyYG2jrjVhz o6gBcrBOfWu9AyPLgdWAGsF1wOBxZQ96ELLnhoixe3N1NXBmyWF19xSbyQO/jB09dEVE ZwOzMPyg4/Nool6Bt/MTre1ClXv7qbgcTCsXpw6baIU8/y4Wp0rlFQ/u7K26B95CcHA2 9oAwIa6NOlZRe/gTawM9l0AmisxdYd0PBkW2x7+VpLBiZ0v08CNJu/xhydiR0zTKiijB qzEw== X-Gm-Message-State: AOAM5312+hCCXyGfCOx2L8Tdqcca4g89IX2SFIXS+1Q9Z0MzXIhHlUIx U1RPqcnVh91/aTtjoekfH1We8YbNxi44FMDqyQ== X-Google-Smtp-Source: ABdhPJw2ODvanBSozJWdZOTOOz4pky/am0l4h9K0Yc2osyjVOYF7QQNN7hmspokM1vkFjP4IlfDsei8eTnXOnwAsTiw= X-Received: by 2002:aa7:c89a:: with SMTP id p26mr23092475eds.373.1626546653992; Sat, 17 Jul 2021 11:30:53 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com> In-Reply-To: From: Zaphod Beeblebrox Date: Sat, 17 Jul 2021 14:30:41 -0400 Message-ID: Subject: Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS To: Warner Losh Cc: Graham Perrin , Current FreeBSD Content-Type: multipart/alternative; boundary="000000000000859c5705c755e725" X-Rspamd-Queue-Id: 4GRxWN1hMgz4Ys1 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-ThisMailContainsUnwantedMimeParts: Y --000000000000859c5705c755e725 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable One thing, that I'm sure the developers know, but that might be underappreciated at the user level: These things are little computers ... with their own little operating systems and as such, their own little bugs. This means that the quality can swing very wildly between different examples of cheap dodgy hardware. I mean... it's also true that disk drives have been run by microcontrollers for 20-odd-years (or more), but the sheer number of vendors who can contract with PCBwaaaay (favorite utuber in my head, sorry) and get an NVMe drive completely manufactured and onto Amazon is somewhat unprecedented. Dodgy hardware doesn't need a factory anymore. On Sat, Jul 17, 2021 at 11:48 AM Warner Losh wrote: > On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin > wrote: > > > When the file system is stress-tested, it seems that the device (an > > internal drive) is lost. > > > > This is most likely a drive problem. Netflix pushes half a dozen differen= t > lower-end > models of NVMe drives to their physical limits w/o seeing issues like thi= s. > > That said, our screening process screens out several low-quality drives > that just > lose their minds from time to time. > > > > A recent photograph: > > > > > > > > Transcribed manually: > > > > nvme0: Resetting controller due to a timeout. > > nvme0: resetting controller > > nvme0: controller ready did not become 0 within 5500 ms > > > > Here the controller failed hard. We were unable to reset it within 5 > seconds. One might > be able to tweak the timeouts to cope with the drive better. Do you have = to > power cycle > to get it to respond again? > > > > nvme0: failing outstanding i/o > > nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64 > > nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0 > > g_vfs_done():nvd0p2[WRITE(offset=3D151370924032, length=3D32768)]error = =3D 6 > > UFS: forcibly unmounting /dev/nvd0p2 from / > > nvme0: failing outstanding i/o > > > > =E2=80=A6 et cetera. > > > > Is this a sure sign of a hardware problem? Or must I do something > > special to gain reliability under stress? > > > > It's most likely a hardware problem. that said, I've been working on > patches to > make the recovery when errors like this happen better. > > > > I don't how to interpret parts of the manual page for nvme(4). There's > > direction to include this line in loader.conf(5): > > > > nvme_load=3D"YES" > > > > =E2=80=93 however when I used kldload(8), it seemed that the module was= already > > loaded, or in kernel. > > > > Yes. If you are using it at all, you have the driver. > > > > Using StressDisk: > > > > > > > > =E2=80=93 failures typically occur after around six minutes of testing. > > > > Do you have a number of these drives, or is it just this one bad apple? > > > > The drive is very new, less than 2 TB written: > > > > > > > > I do suspect a hardware problem, because two prior installations of > > Windows 10 became non-bootable. > > > > That's likely a huge red flag. > > > > Also: I find peculiarities with use of fsck_ffs(8), which I can describ= e > > later. Maybe to be expected, if there's a problem with the drive. > > > > You can ask Kirk, but if data isn't written to the drive when the firmwar= e > crashes, then there may be data loss. > > Warner > --000000000000859c5705c755e725--