Re: Changes in cam/nvme causes issues?

From: Alexander Leidinger <Alexander_at_Leidinger.net>
Date: Tue, 23 Dec 2025 09:31:43 UTC
Am 2025-12-22 17:58, schrieb Warner Losh:

> On Sun, Dec 21, 2025 at 8:37 AM Alexander Leidinger 
> <Alexander@leidinger.net> wrote:
> 
> Am 2025-12-14 14:05, schrieb Warner Losh:
> 
> Let's do one issue at a time. There's too much missing info. Top 
> posting since there's  not a lot of context to this request
> 
> The disk died now completely, so the CRC errors are out of reach now.
> 
> First, let's start with pciconf -l of the nvme drive. I have a strong 
> idea, but need some data.
> 
> While already provided privately with some other data, here for the 
> public so that people are aware that currently there is an issue with 
> such drives:
> nvme0@pci0:5:0:0: class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d 
> device=0xa809 subvendor=0x144d subdevice=0xa801
> Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V

Yea, so far this is the only report I've received, and there's not 
enough data in it to reproduce it with any of the dozen NVMe drives that 
I have, or to spot a difference with what I know I check in the code. So 
if it's compiled into the kernel with cam also compiled into the kernel, 
I know it works.

CAM is in the kerne, nvme is loaded as a module (from 15-current):
---snip---
# kldstat | egrep '(nvm|cam)'
  2    1 0xffffffff811e3000    20db8 nvme.ko
---snip---

I will do a clean rebuild with the most recent 16-current and provide a 
full dmesg if this still doesn't work.

Bye,
Alexander.

> Warner
> 
> Bye,
> Alexander.
> 
> Also, the disk report needs full logs with and without the settings 
> that have uncorrectable in them. I'd expect that a shorter timeout 
> would lead to different behavior, but maybe that error syndrome isn't 
> one I've seen. It would also be helpful to know which of the times 
> changes the behavior...
> 
> Warner
> 
> On Sun, Dec 14, 2025, 5:06 AM Alexander Leidinger 
> <Alexander@leidinger.net> wrote: Hi Warner,
> 
> I try to update a 15-current (as of 2025-11-27-110715) to a recent 16
> (as of 2025-12-13-132815). It fails to import a pool due to a missing
> nvme. I also have a broken HD in this system... to be on the safe side 
> I
> mention it.
> 
> This is from 15-current:
> ---snip---
> NAME                               STATE     READ WRITE CKSUM
> rpool                              DEGRADED     0     0     0
> mirror-0                         DEGRADED     0     0     0
> diskid/DISK-WD-WCC4N4KLEZT7p3  ONLINE       0     0     0
> diskid/DISK-WD-WCC4N1DF9DA2p3  ONLINE       0     0     0
> diskid/DISK-WD-WX52D625R0NTp3  ONLINE       0     0     0
> diskid/DISK-WD-WCC4N1PYJ3F8p3  OFFLINE      0     0     0
> logs
> diskid/DISK-493504058890547p1    ONLINE       0     0     0
> cache
> diskid/DISK-493504058890547p2    ONLINE       0     0     0
> 
> NAME                               STATE     READ WRITE CKSUM
> space                              DEGRADED     0     0     0
> raidz2-0                         DEGRADED     0     0     0
> diskid/DISK-WD-WCC4N4KLEZT7p4  ONLINE       0     0     0
> diskid/DISK-WD-WCC4N1DF9DA2p4  ONLINE       0     0     0
> diskid/DISK-WD-WX52D625R0NTp4  ONLINE       0     0     0
> diskid/DISK-WD-WX52D625R2TPp4  ONLINE       0     0     0
> diskid/DISK-WD-WCC4N1PYJ3F8p4  OFFLINE      0     0     0
> logs
> diskid/DISK-S649NL0T819360Vp2    ONLINE       0     0     0
> cache
> diskid/DISK-S649NL0T819360Vp3    ONLINE       0     0     0
> ---snip---
> 
> The offline marked partitions are on the same HD (the broken one). The
> DISK-S649NL0T819360V device use as log and cache in the second pool
> causes the issue on 16-current.
> 
> On 16-current I get "uncorrectable parity/CRC error" messages on boot
> from the broken disk. I used this to get rid of those errors:
> ---snip---
> # grep kern.cam /tmp/be_mount.MhLw/boot/loader.conf
> kern.cam.tur_timeout="60"
> kern.cam.inquiry_timeout="60"
> kern.cam.modesense_timeout="60"
> ---snip---
> 
> But the second pool ("space") fails to get imported. When I import it
> via "zpool import -m space" it shows me that the log and cache devices
> (different partitions on the same hardware) are not available.
> This is the device in question as seen from 15-current:
> ---snip---
> nda0: <Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V>
> nda0: Serial Number S649NL0T819360V
> [1] nda0: nvme version 1.4
> nda0: 953869MB (1953525168 512 byte sectors)
> [1] GEOM: new disk nda0
> ...
> [1] pass6 at nvme0 bus 0 scbus6 target 0 lun 1
> pass6: <Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V>
> pass6: Serial Number S649NL0T819360V
> [1] pass6: nvme version 1.4
> ---snip---
> 
> In case you need some info from the 15- or 16-current BE, which info do
> you need?
> 
> Bye,
> Alexander.
> 
> --
> http://www.Leidinger.net Alexander@Leidinger.net: PGP 
> 0x8F31830F9F2772BF
> http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 
> 0x8F31830F9F2772BF

-- 
http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF

-- 
http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF