svn commit: r292074 - in head/sys/dev: nvd nvme

Fri Mar 11 16:31:50 UTC 2016

On Fri, Mar 11, 2016 at 9:24 AM, Warner Losh <imp at bsdimp.com> wrote:

>
>
> On Fri, Mar 11, 2016 at 9:15 AM, Alan Somers <asomers at freebsd.org> wrote:
>
>> Interesting.  I didn't know about the alternate meaning of stripesize.  I
>> agree then that there's currently no way to tune ZFS to respect NVME's
>> 128KB boundaries.  One could set zfs.vfs.vdev.aggregation_limit to 128KB,
>> but that would only halfway solve the problem, because allocations could be
>> unaligned.  Frankly, I'm surprised that NVME drives should have such a
>> small limit when SATA and SAS devices commonly handle single commands that
>> span multiple MB.  I don't think there's any way to adapt ZFS to this limit
>> without hurting it in other ways; for example by restricting its ability to
>> use large _or_ small record sizes.
>>
>> Hopefully the NVME slow path isn't _too_ slow.
>>
>
> Let's be clear here: this is purely an Intel controller issue, not an nvme
> issue. Most other nvme drives don't have any issues with this at all. At
> least for the drives I've been testing from well known NAND players (I'm
> unsure if they are released yet, so I can't name names, other than to say
> that they aren't OCZ). All these NVMe drives handle 1MB I/Os with
> approximately the same performance as 128k or 64k I/Os. The enterprise
> grade drives are quite fast and quite nice. It's the lower end, consumer
> drives that have more issues. Since those have been eliminated from our
> detailed consideration, I'm unsure if they have issues.
>
> And the Intel issue is a more subtle one having to do with PCIe burst
> sizes than necessarily crossing the 128k boundary. I've asked my contacts
> inside of Intel that I don't think read these lists for the exact details.
>

And keep in mind the original description was this:

Quote:

Intel NVMe controllers have a slow path for I/Os that span
a 128KB stripe boundary but ZFS limits ashift, which is derived
from d_stripesize, to 13 (8KB) so we limit the stripesize
reported to geom(8) to 4KB.

This may result in a small number of additional I/Os
to require splitting in nvme(4), however the NVMe I/O
path is very efficient so these additional I/Os will cause
very minimal (if any) difference in performance or
CPU utilisation.

unquote

so the issue seems to being blown up a bit. It's better if you
don't generate these I/Os, but the driver copes by splitting them
on the affected drives causing a small inefficiency because you're
increasing the IOs needed to do the I/O, cutting into the IOPS budget.

Warner

> Warner
>
>
>> On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <mav at freebsd.org> wrote:
>>
>>> On 11.03.16 06:58, Alan Somers wrote:
>>> > Do they behave badly for writes that cross a 128KB boundary, but are
>>> > nonetheless aligned to 128KB boundaries?  Then I don't understand how
>>> > this change (or mav's replacement) is supposed to help.  The stripesize
>>> > is supposed to be the minimum write that the device can accept without
>>> > requiring a read-modify-write.  ZFS guarantees that it will never issue
>>> > a write smaller than the stripesize, nor will it ever issue a write
>>> that
>>> > is not aligned to a stripesize-boundary.  But even if ZFS worked with
>>> > 128KB stripesizes, it would still happily issue writes a multiple of
>>> > 128KB in size, and these would cross those boundaries.  Am I not
>>> > understanding something here?
>>>
>>> stripesize is not necessary related to read-modify-write.  It reports
>>> "some" native boundaries of the device.  For example, RAID0 array has
>>> stripes, crossing which does not cause read-modify-write cycles, but
>>> causes I/O split and head seeks for extra disks.  This, as I understand,
>>> is the case for some Intel's NVMe device models here, and is the reason
>>> why 128KB stripesize was originally reported.
>>>
>>> We can not demand all file systems to never issue I/Os of less then
>>> stripesize, since it can be 128KB, 1MB or even more (and since then it
>>> would be called sectorsize).  If ZFS (in this case) doesn't support
>>> allocation block sizes above 8K (and even that is very
>>> space-inefficient), and it has no other mechanisms to optimize I/O
>>> alignment, then it is not a problem of the NVMe device or driver, but
>>> only of ZFS itself.  So what I have done here is moved workaround from
>>> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
>>> its native 128K bondaries, that will be respected, for example, by
>>> gpart, that help, for example UFS align its 32K blocks, while ZFS will
>>> correctly ignore values for which it can't optimize, falling back to
>>> efficient 512 bytes allocations.
>>>
>>> PS about the meaning of stripesize not limited to read-modify-write: For
>>> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
>>> and 256K: aligned writes of 4K allow to avoid read-modify-write inside
>>> the drive, I/Os not crossing 64K boundaries without reason improve
>>> parallel performance, aligned writes of 256K allow to avoid
>>> read-modify-write on the RAID5 level.  Obviously not all of those
>>> optimizations achievable in all environments, and the bigger the stripe
>>> size the harder optimize for it, but it does not mean that such
>>> optimization is impossible.  It would be good to be able to report all
>>> of them, allowing each consumer to use as many of them as it can.
>>>
>>> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp at bsdimp.com
>>> > <mailto:imp at bsdimp.com>> wrote:
>>> >
>>> >     Some Intel NVMe drives behave badly when the LBA range crosses a
>>> >     128k boundary. Their
>>> >     performance is worse for those transactions than for ones that
>>> don't
>>> >     cross the 128k boundary.
>>> >
>>> >     Warner
>>> >
>>> >     On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers at freebsd.org
>>> >     <mailto:asomers at freebsd.org>> wrote:
>>> >
>>> >         Are you saying that Intel NVMe controllers perform poorly for
>>> >         all I/Os that are less than 128KB, or just for I/Os of any size
>>> >         that cross a 128KB boundary?
>>> >
>>> >         On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
>>> >         <smh at freebsd.org <mailto:smh at freebsd.org>> wrote:
>>> >
>>> >             Author: smh
>>> >             Date: Fri Dec 11 02:06:03 2015
>>> >             New Revision: 292074
>>> >             URL: https://svnweb.freebsd.org/changeset/base/292074
>>> >
>>> >             Log:
>>> >               Limit stripesize reported from nvd(4) to 4K
>>> >
>>> >               Intel NVMe controllers have a slow path for I/Os that
>>> span
>>> >             a 128KB stripe boundary but ZFS limits ashift, which is
>>> >             derived from d_stripesize, to 13 (8KB) so we limit the
>>> >             stripesize reported to geom(8) to 4KB.
>>> >
>>> >               This may result in a small number of additional I/Os to
>>> >             require splitting in nvme(4), however the NVMe I/O path is
>>> >             very efficient so these additional I/Os will cause very
>>> >             minimal (if any) difference in performance or CPU
>>> utilisation.
>>> >
>>> >               This can be controller by the new sysctl
>>> >             kern.nvme.max_optimal_sectorsize.
>>> >
>>> >               MFC after:    1 week
>>> >               Sponsored by: Multiplay
>>> >               Differential Revision:
>>> >             https://reviews.freebsd.org/D4446
>>> >
>>> >             Modified:
>>> >               head/sys/dev/nvd/nvd.c
>>> >               head/sys/dev/nvme/nvme.h
>>> >               head/sys/dev/nvme/nvme_ns.c
>>> >               head/sys/dev/nvme/nvme_sysctl.c
>>> >
>>> >
>>> >
>>>
>>>
>>> --
>>> Alexander Motin
>>>
>>
>>
>