svn commit: r292074 - in head/sys/dev: nvd nvme

Fri Mar 11 16:24:43 UTC 2016

On Fri, Mar 11, 2016 at 9:15 AM, Alan Somers <asomers at freebsd.org> wrote:

> Interesting.  I didn't know about the alternate meaning of stripesize.  I
> agree then that there's currently no way to tune ZFS to respect NVME's
> 128KB boundaries.  One could set zfs.vfs.vdev.aggregation_limit to 128KB,
> but that would only halfway solve the problem, because allocations could be
> unaligned.  Frankly, I'm surprised that NVME drives should have such a
> small limit when SATA and SAS devices commonly handle single commands that
> span multiple MB.  I don't think there's any way to adapt ZFS to this limit
> without hurting it in other ways; for example by restricting its ability to
> use large _or_ small record sizes.
>
> Hopefully the NVME slow path isn't _too_ slow.
>

Let's be clear here: this is purely an Intel controller issue, not an nvme
issue. Most other nvme drives don't have any issues with this at all. At
least for the drives I've been testing from well known NAND players (I'm
unsure if they are released yet, so I can't name names, other than to say
that they aren't OCZ). All these NVMe drives handle 1MB I/Os with
approximately the same performance as 128k or 64k I/Os. The enterprise
grade drives are quite fast and quite nice. It's the lower end, consumer
drives that have more issues. Since those have been eliminated from our
detailed consideration, I'm unsure if they have issues.

And the Intel issue is a more subtle one having to do with PCIe burst sizes
than necessarily crossing the 128k boundary. I've asked my contacts inside
of Intel that I don't think read these lists for the exact details.

Warner

> On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <mav at freebsd.org> wrote:
>
>> On 11.03.16 06:58, Alan Somers wrote:
>> > Do they behave badly for writes that cross a 128KB boundary, but are
>> > nonetheless aligned to 128KB boundaries?  Then I don't understand how
>> > this change (or mav's replacement) is supposed to help.  The stripesize
>> > is supposed to be the minimum write that the device can accept without
>> > requiring a read-modify-write.  ZFS guarantees that it will never issue
>> > a write smaller than the stripesize, nor will it ever issue a write that
>> > is not aligned to a stripesize-boundary.  But even if ZFS worked with
>> > 128KB stripesizes, it would still happily issue writes a multiple of
>> > 128KB in size, and these would cross those boundaries.  Am I not
>> > understanding something here?
>>
>> stripesize is not necessary related to read-modify-write.  It reports
>> "some" native boundaries of the device.  For example, RAID0 array has
>> stripes, crossing which does not cause read-modify-write cycles, but
>> causes I/O split and head seeks for extra disks.  This, as I understand,
>> is the case for some Intel's NVMe device models here, and is the reason
>> why 128KB stripesize was originally reported.
>>
>> We can not demand all file systems to never issue I/Os of less then
>> stripesize, since it can be 128KB, 1MB or even more (and since then it
>> would be called sectorsize).  If ZFS (in this case) doesn't support
>> allocation block sizes above 8K (and even that is very
>> space-inefficient), and it has no other mechanisms to optimize I/O
>> alignment, then it is not a problem of the NVMe device or driver, but
>> only of ZFS itself.  So what I have done here is moved workaround from
>> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
>> its native 128K bondaries, that will be respected, for example, by
>> gpart, that help, for example UFS align its 32K blocks, while ZFS will
>> correctly ignore values for which it can't optimize, falling back to
>> efficient 512 bytes allocations.
>>
>> PS about the meaning of stripesize not limited to read-modify-write: For
>> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
>> and 256K: aligned writes of 4K allow to avoid read-modify-write inside
>> the drive, I/Os not crossing 64K boundaries without reason improve
>> parallel performance, aligned writes of 256K allow to avoid
>> read-modify-write on the RAID5 level.  Obviously not all of those
>> optimizations achievable in all environments, and the bigger the stripe
>> size the harder optimize for it, but it does not mean that such
>> optimization is impossible.  It would be good to be able to report all
>> of them, allowing each consumer to use as many of them as it can.
>>
>> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp at bsdimp.com
>> > <mailto:imp at bsdimp.com>> wrote:
>> >
>> >     Some Intel NVMe drives behave badly when the LBA range crosses a
>> >     128k boundary. Their
>> >     performance is worse for those transactions than for ones that don't
>> >     cross the 128k boundary.
>> >
>> >     Warner
>> >
>> >     On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers at freebsd.org
>> >     <mailto:asomers at freebsd.org>> wrote:
>> >
>> >         Are you saying that Intel NVMe controllers perform poorly for
>> >         all I/Os that are less than 128KB, or just for I/Os of any size
>> >         that cross a 128KB boundary?
>> >
>> >         On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
>> >         <smh at freebsd.org <mailto:smh at freebsd.org>> wrote:
>> >
>> >             Author: smh
>> >             Date: Fri Dec 11 02:06:03 2015
>> >             New Revision: 292074
>> >             URL: https://svnweb.freebsd.org/changeset/base/292074
>> >
>> >             Log:
>> >               Limit stripesize reported from nvd(4) to 4K
>> >
>> >               Intel NVMe controllers have a slow path for I/Os that span
>> >             a 128KB stripe boundary but ZFS limits ashift, which is
>> >             derived from d_stripesize, to 13 (8KB) so we limit the
>> >             stripesize reported to geom(8) to 4KB.
>> >
>> >               This may result in a small number of additional I/Os to
>> >             require splitting in nvme(4), however the NVMe I/O path is
>> >             very efficient so these additional I/Os will cause very
>> >             minimal (if any) difference in performance or CPU
>> utilisation.
>> >
>> >               This can be controller by the new sysctl
>> >             kern.nvme.max_optimal_sectorsize.
>> >
>> >               MFC after:    1 week
>> >               Sponsored by: Multiplay
>> >               Differential Revision:
>> >             https://reviews.freebsd.org/D4446
>> >
>> >             Modified:
>> >               head/sys/dev/nvd/nvd.c
>> >               head/sys/dev/nvme/nvme.h
>> >               head/sys/dev/nvme/nvme_ns.c
>> >               head/sys/dev/nvme/nvme_sysctl.c
>> >
>> >
>> >
>>
>>
>> --
>> Alexander Motin
>>
>
>