svn commit: r292074 - in head/sys/dev: nvd nvme

Fri Mar 11 16:15:53 UTC 2016

Interesting.  I didn't know about the alternate meaning of stripesize.  I
agree then that there's currently no way to tune ZFS to respect NVME's
128KB boundaries.  One could set zfs.vfs.vdev.aggregation_limit to 128KB,
but that would only halfway solve the problem, because allocations could be
unaligned.  Frankly, I'm surprised that NVME drives should have such a
small limit when SATA and SAS devices commonly handle single commands that
span multiple MB.  I don't think there's any way to adapt ZFS to this limit
without hurting it in other ways; for example by restricting its ability to
use large _or_ small record sizes.

Hopefully the NVME slow path isn't _too_ slow.

On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <mav at freebsd.org> wrote:

> On 11.03.16 06:58, Alan Somers wrote:
> > Do they behave badly for writes that cross a 128KB boundary, but are
> > nonetheless aligned to 128KB boundaries?  Then I don't understand how
> > this change (or mav's replacement) is supposed to help.  The stripesize
> > is supposed to be the minimum write that the device can accept without
> > requiring a read-modify-write.  ZFS guarantees that it will never issue
> > a write smaller than the stripesize, nor will it ever issue a write that
> > is not aligned to a stripesize-boundary.  But even if ZFS worked with
> > 128KB stripesizes, it would still happily issue writes a multiple of
> > 128KB in size, and these would cross those boundaries.  Am I not
> > understanding something here?
>
> stripesize is not necessary related to read-modify-write.  It reports
> "some" native boundaries of the device.  For example, RAID0 array has
> stripes, crossing which does not cause read-modify-write cycles, but
> causes I/O split and head seeks for extra disks.  This, as I understand,
> is the case for some Intel's NVMe device models here, and is the reason
> why 128KB stripesize was originally reported.
>
> We can not demand all file systems to never issue I/Os of less then
> stripesize, since it can be 128KB, 1MB or even more (and since then it
> would be called sectorsize).  If ZFS (in this case) doesn't support
> allocation block sizes above 8K (and even that is very
> space-inefficient), and it has no other mechanisms to optimize I/O
> alignment, then it is not a problem of the NVMe device or driver, but
> only of ZFS itself.  So what I have done here is moved workaround from
> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
> its native 128K bondaries, that will be respected, for example, by
> gpart, that help, for example UFS align its 32K blocks, while ZFS will
> correctly ignore values for which it can't optimize, falling back to
> efficient 512 bytes allocations.
>
> PS about the meaning of stripesize not limited to read-modify-write: For
> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
> and 256K: aligned writes of 4K allow to avoid read-modify-write inside
> the drive, I/Os not crossing 64K boundaries without reason improve
> parallel performance, aligned writes of 256K allow to avoid
> read-modify-write on the RAID5 level.  Obviously not all of those
> optimizations achievable in all environments, and the bigger the stripe
> size the harder optimize for it, but it does not mean that such
> optimization is impossible.  It would be good to be able to report all
> of them, allowing each consumer to use as many of them as it can.
>
> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp at bsdimp.com
> > <mailto:imp at bsdimp.com>> wrote:
> >
> >     Some Intel NVMe drives behave badly when the LBA range crosses a
> >     128k boundary. Their
> >     performance is worse for those transactions than for ones that don't
> >     cross the 128k boundary.
> >
> >     Warner
> >
> >     On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers at freebsd.org
> >     <mailto:asomers at freebsd.org>> wrote:
> >
> >         Are you saying that Intel NVMe controllers perform poorly for
> >         all I/Os that are less than 128KB, or just for I/Os of any size
> >         that cross a 128KB boundary?
> >
> >         On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
> >         <smh at freebsd.org <mailto:smh at freebsd.org>> wrote:
> >
> >             Author: smh
> >             Date: Fri Dec 11 02:06:03 2015
> >             New Revision: 292074
> >             URL: https://svnweb.freebsd.org/changeset/base/292074
> >
> >             Log:
> >               Limit stripesize reported from nvd(4) to 4K
> >
> >               Intel NVMe controllers have a slow path for I/Os that span
> >             a 128KB stripe boundary but ZFS limits ashift, which is
> >             derived from d_stripesize, to 13 (8KB) so we limit the
> >             stripesize reported to geom(8) to 4KB.
> >
> >               This may result in a small number of additional I/Os to
> >             require splitting in nvme(4), however the NVMe I/O path is
> >             very efficient so these additional I/Os will cause very
> >             minimal (if any) difference in performance or CPU
> utilisation.
> >
> >               This can be controller by the new sysctl
> >             kern.nvme.max_optimal_sectorsize.
> >
> >               MFC after:    1 week
> >               Sponsored by: Multiplay
> >               Differential Revision:
> >             https://reviews.freebsd.org/D4446
> >
> >             Modified:
> >               head/sys/dev/nvd/nvd.c
> >               head/sys/dev/nvme/nvme.h
> >               head/sys/dev/nvme/nvme_ns.c
> >               head/sys/dev/nvme/nvme_sysctl.c
> >
> >
> >
>
>
> --
> Alexander Motin
>