svn commit: r292074 - in head/sys/dev: nvd nvme
Warner Losh
imp at bsdimp.com
Fri Mar 11 16:31:50 UTC 2016
On Fri, Mar 11, 2016 at 9:24 AM, Warner Losh <imp at bsdimp.com> wrote:
>
>
> On Fri, Mar 11, 2016 at 9:15 AM, Alan Somers <asomers at freebsd.org> wrote:
>
>> Interesting. I didn't know about the alternate meaning of stripesize. I
>> agree then that there's currently no way to tune ZFS to respect NVME's
>> 128KB boundaries. One could set zfs.vfs.vdev.aggregation_limit to 128KB,
>> but that would only halfway solve the problem, because allocations could be
>> unaligned. Frankly, I'm surprised that NVME drives should have such a
>> small limit when SATA and SAS devices commonly handle single commands that
>> span multiple MB. I don't think there's any way to adapt ZFS to this limit
>> without hurting it in other ways; for example by restricting its ability to
>> use large _or_ small record sizes.
>>
>> Hopefully the NVME slow path isn't _too_ slow.
>>
>
> Let's be clear here: this is purely an Intel controller issue, not an nvme
> issue. Most other nvme drives don't have any issues with this at all. At
> least for the drives I've been testing from well known NAND players (I'm
> unsure if they are released yet, so I can't name names, other than to say
> that they aren't OCZ). All these NVMe drives handle 1MB I/Os with
> approximately the same performance as 128k or 64k I/Os. The enterprise
> grade drives are quite fast and quite nice. It's the lower end, consumer
> drives that have more issues. Since those have been eliminated from our
> detailed consideration, I'm unsure if they have issues.
>
> And the Intel issue is a more subtle one having to do with PCIe burst
> sizes than necessarily crossing the 128k boundary. I've asked my contacts
> inside of Intel that I don't think read these lists for the exact details.
>
And keep in mind the original description was this:
Quote:
Intel NVMe controllers have a slow path for I/Os that span
a 128KB stripe boundary but ZFS limits ashift, which is derived
from d_stripesize, to 13 (8KB) so we limit the stripesize
reported to geom(8) to 4KB.
This may result in a small number of additional I/Os
to require splitting in nvme(4), however the NVMe I/O
path is very efficient so these additional I/Os will cause
very minimal (if any) difference in performance or
CPU utilisation.
unquote
so the issue seems to being blown up a bit. It's better if you
don't generate these I/Os, but the driver copes by splitting them
on the affected drives causing a small inefficiency because you're
increasing the IOs needed to do the I/O, cutting into the IOPS budget.
Warner
> Warner
>
>
>> On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <mav at freebsd.org> wrote:
>>
>>> On 11.03.16 06:58, Alan Somers wrote:
>>> > Do they behave badly for writes that cross a 128KB boundary, but are
>>> > nonetheless aligned to 128KB boundaries? Then I don't understand how
>>> > this change (or mav's replacement) is supposed to help. The stripesize
>>> > is supposed to be the minimum write that the device can accept without
>>> > requiring a read-modify-write. ZFS guarantees that it will never issue
>>> > a write smaller than the stripesize, nor will it ever issue a write
>>> that
>>> > is not aligned to a stripesize-boundary. But even if ZFS worked with
>>> > 128KB stripesizes, it would still happily issue writes a multiple of
>>> > 128KB in size, and these would cross those boundaries. Am I not
>>> > understanding something here?
>>>
>>> stripesize is not necessary related to read-modify-write. It reports
>>> "some" native boundaries of the device. For example, RAID0 array has
>>> stripes, crossing which does not cause read-modify-write cycles, but
>>> causes I/O split and head seeks for extra disks. This, as I understand,
>>> is the case for some Intel's NVMe device models here, and is the reason
>>> why 128KB stripesize was originally reported.
>>>
>>> We can not demand all file systems to never issue I/Os of less then
>>> stripesize, since it can be 128KB, 1MB or even more (and since then it
>>> would be called sectorsize). If ZFS (in this case) doesn't support
>>> allocation block sizes above 8K (and even that is very
>>> space-inefficient), and it has no other mechanisms to optimize I/O
>>> alignment, then it is not a problem of the NVMe device or driver, but
>>> only of ZFS itself. So what I have done here is moved workaround from
>>> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
>>> its native 128K bondaries, that will be respected, for example, by
>>> gpart, that help, for example UFS align its 32K blocks, while ZFS will
>>> correctly ignore values for which it can't optimize, falling back to
>>> efficient 512 bytes allocations.
>>>
>>> PS about the meaning of stripesize not limited to read-modify-write: For
>>> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
>>> and 256K: aligned writes of 4K allow to avoid read-modify-write inside
>>> the drive, I/Os not crossing 64K boundaries without reason improve
>>> parallel performance, aligned writes of 256K allow to avoid
>>> read-modify-write on the RAID5 level. Obviously not all of those
>>> optimizations achievable in all environments, and the bigger the stripe
>>> size the harder optimize for it, but it does not mean that such
>>> optimization is impossible. It would be good to be able to report all
>>> of them, allowing each consumer to use as many of them as it can.
>>>
>>> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp at bsdimp.com
>>> > <mailto:imp at bsdimp.com>> wrote:
>>> >
>>> > Some Intel NVMe drives behave badly when the LBA range crosses a
>>> > 128k boundary. Their
>>> > performance is worse for those transactions than for ones that
>>> don't
>>> > cross the 128k boundary.
>>> >
>>> > Warner
>>> >
>>> > On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers at freebsd.org
>>> > <mailto:asomers at freebsd.org>> wrote:
>>> >
>>> > Are you saying that Intel NVMe controllers perform poorly for
>>> > all I/Os that are less than 128KB, or just for I/Os of any size
>>> > that cross a 128KB boundary?
>>> >
>>> > On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
>>> > <smh at freebsd.org <mailto:smh at freebsd.org>> wrote:
>>> >
>>> > Author: smh
>>> > Date: Fri Dec 11 02:06:03 2015
>>> > New Revision: 292074
>>> > URL: https://svnweb.freebsd.org/changeset/base/292074
>>> >
>>> > Log:
>>> > Limit stripesize reported from nvd(4) to 4K
>>> >
>>> > Intel NVMe controllers have a slow path for I/Os that
>>> span
>>> > a 128KB stripe boundary but ZFS limits ashift, which is
>>> > derived from d_stripesize, to 13 (8KB) so we limit the
>>> > stripesize reported to geom(8) to 4KB.
>>> >
>>> > This may result in a small number of additional I/Os to
>>> > require splitting in nvme(4), however the NVMe I/O path is
>>> > very efficient so these additional I/Os will cause very
>>> > minimal (if any) difference in performance or CPU
>>> utilisation.
>>> >
>>> > This can be controller by the new sysctl
>>> > kern.nvme.max_optimal_sectorsize.
>>> >
>>> > MFC after: 1 week
>>> > Sponsored by: Multiplay
>>> > Differential Revision:
>>> > https://reviews.freebsd.org/D4446
>>> >
>>> > Modified:
>>> > head/sys/dev/nvd/nvd.c
>>> > head/sys/dev/nvme/nvme.h
>>> > head/sys/dev/nvme/nvme_ns.c
>>> > head/sys/dev/nvme/nvme_sysctl.c
>>> >
>>> >
>>> >
>>>
>>>
>>> --
>>> Alexander Motin
>>>
>>
>>
>
More information about the svn-src-all
mailing list