Re: FYI: an example Optane Read failure on a HoneyComb (aarch64), main at e78dc78e517a
- In reply to: Mark Millard : "FYI: an example Optane Read failure on a HoneyComb (aarch64), main at e78dc78e517a"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 03 Mar 2023 04:50:50 UTC
On Thu, Mar 2, 2023 at 9:39 PM Mark Millard <marklmi@yahoo.com> wrote: > FYI: I got the following error: > > nvme0: RECOVERY_START 411856860627824 vs 411855426224359 > Translation: We submitted transactions to the card and got a timeout waiting for them to finish. > nvme0: Controller in fatal status, resetting > The controller status bit indicating failure was on > nvme0: Resetting controller due to a timeout and possible hot unplug. > We read 0xfffffff from the card, indicating often a power glitch reset the card, but sometimes it's a bridge getting messed up so the address we think the card is at it isn't able to get transactions for. Or sometimes it's because the firmware crashes in the card. hard to diagnose for sure, but one thing is for sure: the card is AFU and we can't fix it. > nvme0: RECOVERY_WAITING > nvme0: resetting controller > nvme0: failing outstanding i/o > We reset the controller. > nvme0: READ sqid:2 cid:5 nsid:1 lba:537405568 len:16 > nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:2 cid:5 cdw0:0 > (nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 > cdw=20082880 0 f 0 0 0 > (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error > (nda0:nvme0:0:0:1): Error 5, Retries exhausted > g_vfs_done():gpt/CA72optM2ufs[READ(offset=65536, length=8192)]enda0 at > nvme0 bus 0 scbus4 target 0 lun 1 > nda0: rror = 5 > We failed enough times that we gave up. > <INTEL SSDPE21D960GA E2010480 REDACTED> > s/n REDACTED detached > (nda0:nvme0:0:0:1): Periph destroyed > nvme0: waiting > might indicate there's still a reference here, but maybe there's not. The one thing I don't do in recovery is try to power cycle the card. That's now possible with decent APIs in the kernel, but I haven't had the need to do it for our deployment. Warner > (After rebooting . . .) > > # gpart show -pl > => 40 1875384928 nda0 GPT (894G) > 40 532480 nda0p1 CA72optM2efi (260M) > 532520 2008 - free - (1.0M) > 534528 20971520 nda0p2 CA72optM2swp10 (10G) > 21506048 29360128 nda0p4 CA72optM2swp14 (14G) > 50866176 33554432 nda0p5 CA72optM2swp16 (16G) > 84420608 67108864 nda0p6 CA72optM2swp32 (32G) > 151529472 364904448 nda0p7 CA72optM2swp174 (174G) > 516433920 7340032 nda0p8 RPi3swp3p5 (3.5G) > 523773952 13631488 - free - (6.5G) > 537405440 1337979528 nda0p3 CA72optM2ufs (638G) > > => 40 1875384928 nda1 GPT (894G) > 40 532480 nda1p1 CA72opt0EFI (260M) > 532520 2008 - free - (1.0M) > 534528 515899392 nda1p2 CA72opt0SWP (246G) > 516433920 20971520 - free - (10G) > 537405440 1337979528 nda1p3 CA72opt0ZFS (638G) > > nda0 is a U2 Optane used via a M2 adapter. The error is the > first that I've seen for it. (nda1 is in the PCIe slot.) > > # uname -apKU > FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #88 > main-n261230-e78dc78e517a-dirty: Wed Mar 1 16:17:45 PST 2023 > root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 > arm64 aarch64 1400081 1400081 > > === > Mark Millard > marklmi at yahoo.com > > >