Re: FYI: an example Optane Read failure on a HoneyComb (aarch64), main at e78dc78e517a

From: Warner Losh <imp_at_bsdimp.com>
Date: Fri, 03 Mar 2023 04:50:50 UTC
On Thu, Mar 2, 2023 at 9:39 PM Mark Millard <marklmi@yahoo.com> wrote:

> FYI: I got the following error:
>
> nvme0: RECOVERY_START 411856860627824 vs 411855426224359
>

Translation: We submitted transactions to the card and got a timeout waiting
for them to finish.


> nvme0: Controller in fatal status, resetting
>

The controller status bit indicating failure was on


> nvme0: Resetting controller due to a timeout and possible hot unplug.
>

We read 0xfffffff from the card, indicating often a power glitch reset the
card, but sometimes it's a bridge getting messed up so the address we think
the card is at it isn't able to get transactions for. Or sometimes it's
because the firmware crashes in the card. hard to diagnose for sure, but
one thing is for sure: the card is AFU and we can't fix it.


> nvme0: RECOVERY_WAITING
> nvme0: resetting controller
>
nvme0: failing outstanding i/o
>

We reset the controller.


> nvme0: READ sqid:2 cid:5 nsid:1 lba:537405568 len:16
> nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:2 cid:5 cdw0:0
> (nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0
> cdw=20082880 0 f 0 0 0
> (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error
> (nda0:nvme0:0:0:1): Error 5, Retries exhausted
> g_vfs_done():gpt/CA72optM2ufs[READ(offset=65536, length=8192)]enda0 at
> nvme0 bus 0 scbus4 target 0 lun 1
> nda0: rror = 5
>

We failed enough times that we gave up.


> <INTEL SSDPE21D960GA E2010480 REDACTED>
>  s/n REDACTED detached
> (nda0:nvme0:0:0:1): Periph destroyed
> nvme0: waiting
>

might indicate there's still a reference here, but maybe there's not.

The one thing I don't do in recovery is try to power cycle the card. That's
now possible with decent APIs in the kernel, but I haven't had the need to
do it for our deployment.

Warner


> (After rebooting . . .)
>
> # gpart show -pl
> =>        40  1875384928    nda0  GPT  (894G)
>           40      532480  nda0p1  CA72optM2efi  (260M)
>       532520        2008          - free -  (1.0M)
>       534528    20971520  nda0p2  CA72optM2swp10  (10G)
>     21506048    29360128  nda0p4  CA72optM2swp14  (14G)
>     50866176    33554432  nda0p5  CA72optM2swp16  (16G)
>     84420608    67108864  nda0p6  CA72optM2swp32  (32G)
>    151529472   364904448  nda0p7  CA72optM2swp174  (174G)
>    516433920     7340032  nda0p8  RPi3swp3p5  (3.5G)
>    523773952    13631488          - free -  (6.5G)
>    537405440  1337979528  nda0p3  CA72optM2ufs  (638G)
>
> =>        40  1875384928    nda1  GPT  (894G)
>           40      532480  nda1p1  CA72opt0EFI  (260M)
>       532520        2008          - free -  (1.0M)
>       534528   515899392  nda1p2  CA72opt0SWP  (246G)
>    516433920    20971520          - free -  (10G)
>    537405440  1337979528  nda1p3  CA72opt0ZFS  (638G)
>
> nda0 is a U2 Optane used via a M2 adapter. The error is the
> first that I've seen for it. (nda1 is in the PCIe slot.)
>
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #88
> main-n261230-e78dc78e517a-dirty: Wed Mar  1 16:17:45 PST 2023
>  root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72
> arm64 aarch64 1400081 1400081
>
> ===
> Mark Millard
> marklmi at yahoo.com
>
>
>