svn commit: r292074 - in head/sys/dev: nvd nvme
Alan Somers
asomers at freebsd.org
Fri Mar 11 16:15:53 UTC 2016
Interesting. I didn't know about the alternate meaning of stripesize. I
agree then that there's currently no way to tune ZFS to respect NVME's
128KB boundaries. One could set zfs.vfs.vdev.aggregation_limit to 128KB,
but that would only halfway solve the problem, because allocations could be
unaligned. Frankly, I'm surprised that NVME drives should have such a
small limit when SATA and SAS devices commonly handle single commands that
span multiple MB. I don't think there's any way to adapt ZFS to this limit
without hurting it in other ways; for example by restricting its ability to
use large _or_ small record sizes.
Hopefully the NVME slow path isn't _too_ slow.
On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <mav at freebsd.org> wrote:
> On 11.03.16 06:58, Alan Somers wrote:
> > Do they behave badly for writes that cross a 128KB boundary, but are
> > nonetheless aligned to 128KB boundaries? Then I don't understand how
> > this change (or mav's replacement) is supposed to help. The stripesize
> > is supposed to be the minimum write that the device can accept without
> > requiring a read-modify-write. ZFS guarantees that it will never issue
> > a write smaller than the stripesize, nor will it ever issue a write that
> > is not aligned to a stripesize-boundary. But even if ZFS worked with
> > 128KB stripesizes, it would still happily issue writes a multiple of
> > 128KB in size, and these would cross those boundaries. Am I not
> > understanding something here?
>
> stripesize is not necessary related to read-modify-write. It reports
> "some" native boundaries of the device. For example, RAID0 array has
> stripes, crossing which does not cause read-modify-write cycles, but
> causes I/O split and head seeks for extra disks. This, as I understand,
> is the case for some Intel's NVMe device models here, and is the reason
> why 128KB stripesize was originally reported.
>
> We can not demand all file systems to never issue I/Os of less then
> stripesize, since it can be 128KB, 1MB or even more (and since then it
> would be called sectorsize). If ZFS (in this case) doesn't support
> allocation block sizes above 8K (and even that is very
> space-inefficient), and it has no other mechanisms to optimize I/O
> alignment, then it is not a problem of the NVMe device or driver, but
> only of ZFS itself. So what I have done here is moved workaround from
> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
> its native 128K bondaries, that will be respected, for example, by
> gpart, that help, for example UFS align its 32K blocks, while ZFS will
> correctly ignore values for which it can't optimize, falling back to
> efficient 512 bytes allocations.
>
> PS about the meaning of stripesize not limited to read-modify-write: For
> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
> and 256K: aligned writes of 4K allow to avoid read-modify-write inside
> the drive, I/Os not crossing 64K boundaries without reason improve
> parallel performance, aligned writes of 256K allow to avoid
> read-modify-write on the RAID5 level. Obviously not all of those
> optimizations achievable in all environments, and the bigger the stripe
> size the harder optimize for it, but it does not mean that such
> optimization is impossible. It would be good to be able to report all
> of them, allowing each consumer to use as many of them as it can.
>
> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp at bsdimp.com
> > <mailto:imp at bsdimp.com>> wrote:
> >
> > Some Intel NVMe drives behave badly when the LBA range crosses a
> > 128k boundary. Their
> > performance is worse for those transactions than for ones that don't
> > cross the 128k boundary.
> >
> > Warner
> >
> > On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers at freebsd.org
> > <mailto:asomers at freebsd.org>> wrote:
> >
> > Are you saying that Intel NVMe controllers perform poorly for
> > all I/Os that are less than 128KB, or just for I/Os of any size
> > that cross a 128KB boundary?
> >
> > On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
> > <smh at freebsd.org <mailto:smh at freebsd.org>> wrote:
> >
> > Author: smh
> > Date: Fri Dec 11 02:06:03 2015
> > New Revision: 292074
> > URL: https://svnweb.freebsd.org/changeset/base/292074
> >
> > Log:
> > Limit stripesize reported from nvd(4) to 4K
> >
> > Intel NVMe controllers have a slow path for I/Os that span
> > a 128KB stripe boundary but ZFS limits ashift, which is
> > derived from d_stripesize, to 13 (8KB) so we limit the
> > stripesize reported to geom(8) to 4KB.
> >
> > This may result in a small number of additional I/Os to
> > require splitting in nvme(4), however the NVMe I/O path is
> > very efficient so these additional I/Os will cause very
> > minimal (if any) difference in performance or CPU
> utilisation.
> >
> > This can be controller by the new sysctl
> > kern.nvme.max_optimal_sectorsize.
> >
> > MFC after: 1 week
> > Sponsored by: Multiplay
> > Differential Revision:
> > https://reviews.freebsd.org/D4446
> >
> > Modified:
> > head/sys/dev/nvd/nvd.c
> > head/sys/dev/nvme/nvme.h
> > head/sys/dev/nvme/nvme_ns.c
> > head/sys/dev/nvme/nvme_sysctl.c
> >
> >
> >
>
>
> --
> Alexander Motin
>
More information about the svn-src-head
mailing list