Re: nvd->nda switch and blocksize changes for ZFS

From: Frank Behrens <frank_at_harz2023.behrens.de>
Date: Mon, 25 Sep 2023 13:21:12 UTC
Am 25.09.2023 um 13:58 schrieb Dimitry Andric:
> # nvmecontrol identify nda0 and # nvmecontrol identify nvd0 (after 
> hw.nvme.use_nvd="1" and reboot) give the same result:
>> Number of LBA Formats:       1
>> Current LBA Format:          LBA Format #00
>> LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Best
>> ...
>> Optimal I/O Boundary:        0 blocks
>> NVM Capacity:                1000204886016 bytes
>> Preferred Write Granularity: 32 blocks
>> Preferred Write Alignment:   8 blocks
>> Preferred Deallocate Granul: 9600 blocks
>> Preferred Deallocate Align:  9600 blocks
>> Optimal Write Size:          256 blocks
> My guess is that the "Preferred Write Granularity" is the optimal size, in this case 32 'blocks' of 512 bytes, so 16 kiB. This also matches the stripe size reported by geom, as you showed.
>
> The "Preferred Write Alignment" is 8 * 512 = 4 kiB, so you should align partitions etc to at least this. However, it cannot hurt to align everything to 16 kiB either, which is an integer multiple of 4 kiB.

Eugene gave me a tip, so I looked into the drivers.

dev/nvme/nvme_ns.c:
nvme_ns_get_stripesize(struct nvme_namespace *ns)
{
         uint32_t ss;

         if (((ns->data.nsfeat >> NVME_NS_DATA_NSFEAT_NPVALID_SHIFT) &
             NVME_NS_DATA_NSFEAT_NPVALID_MASK) != 0) {
                 ss = nvme_ns_get_sector_size(ns);
                 if (ns->data.npwa != 0)
                         return ((ns->data.npwa + 1) * ss);
                 else if (ns->data.npwg != 0)
                         return ((ns->data.npwg + 1) * ss);
         }
         return (ns->boundary);
}

cam/nvme/nvme_da.c:
         if (((nsd->nsfeat >> NVME_NS_DATA_NSFEAT_NPVALID_SHIFT) &
             NVME_NS_DATA_NSFEAT_NPVALID_MASK) != 0 && nsd->npwg != 0)
                 disk->d_stripesize = ((nsd->npwg + 1) * 
disk->d_sectorsize);
         else
                 disk->d_stripesize = nsd->noiob * disk->d_sectorsize;

So it seems, that nvd uses "sectorsize * Write Alignment" as stripesize  
while nda uses "sectorsize * Write Granularity".

My current interpretation is, that the nvd driver reports the wrong 
value for maximum performance and reliability. I should make a backup 
and re-create the pool.
Maybe we should note in the 14.0 release notes, that the switch to nda 
is not a "nop".

-- 
Frank Behrens
Osterwieck, Germany