svn commit: r292074 - in head/sys/dev: nvd nvme

Fri Mar 11 09:07:14 UTC 2016

On 11.03.16 06:58, Alan Somers wrote:
> Do they behave badly for writes that cross a 128KB boundary, but are
> nonetheless aligned to 128KB boundaries?  Then I don't understand how
> this change (or mav's replacement) is supposed to help.  The stripesize
> is supposed to be the minimum write that the device can accept without
> requiring a read-modify-write.  ZFS guarantees that it will never issue
> a write smaller than the stripesize, nor will it ever issue a write that
> is not aligned to a stripesize-boundary.  But even if ZFS worked with
> 128KB stripesizes, it would still happily issue writes a multiple of
> 128KB in size, and these would cross those boundaries.  Am I not
> understanding something here?

stripesize is not necessary related to read-modify-write.  It reports
"some" native boundaries of the device.  For example, RAID0 array has
stripes, crossing which does not cause read-modify-write cycles, but
causes I/O split and head seeks for extra disks.  This, as I understand,
is the case for some Intel's NVMe device models here, and is the reason
why 128KB stripesize was originally reported.

We can not demand all file systems to never issue I/Os of less then
stripesize, since it can be 128KB, 1MB or even more (and since then it
would be called sectorsize).  If ZFS (in this case) doesn't support
allocation block sizes above 8K (and even that is very
space-inefficient), and it has no other mechanisms to optimize I/O
alignment, then it is not a problem of the NVMe device or driver, but
only of ZFS itself.  So what I have done here is moved workaround from
improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
its native 128K bondaries, that will be respected, for example, by
gpart, that help, for example UFS align its 32K blocks, while ZFS will
correctly ignore values for which it can't optimize, falling back to
efficient 512 bytes allocations.

PS about the meaning of stripesize not limited to read-modify-write: For
example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
and 256K: aligned writes of 4K allow to avoid read-modify-write inside
the drive, I/Os not crossing 64K boundaries without reason improve
parallel performance, aligned writes of 256K allow to avoid
read-modify-write on the RAID5 level.  Obviously not all of those
optimizations achievable in all environments, and the bigger the stripe
size the harder optimize for it, but it does not mean that such
optimization is impossible.  It would be good to be able to report all
of them, allowing each consumer to use as many of them as it can.

> On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp at bsdimp.com
> <mailto:imp at bsdimp.com>> wrote:
> 
>     Some Intel NVMe drives behave badly when the LBA range crosses a
>     128k boundary. Their
>     performance is worse for those transactions than for ones that don't
>     cross the 128k boundary.
> 
>     Warner
> 
>     On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers at freebsd.org
>     <mailto:asomers at freebsd.org>> wrote:
> 
>         Are you saying that Intel NVMe controllers perform poorly for
>         all I/Os that are less than 128KB, or just for I/Os of any size
>         that cross a 128KB boundary?
> 
>         On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
>         <smh at freebsd.org <mailto:smh at freebsd.org>> wrote:
> 
>             Author: smh
>             Date: Fri Dec 11 02:06:03 2015
>             New Revision: 292074
>             URL: https://svnweb.freebsd.org/changeset/base/292074
> 
>             Log:
>               Limit stripesize reported from nvd(4) to 4K
> 
>               Intel NVMe controllers have a slow path for I/Os that span
>             a 128KB stripe boundary but ZFS limits ashift, which is
>             derived from d_stripesize, to 13 (8KB) so we limit the
>             stripesize reported to geom(8) to 4KB.
> 
>               This may result in a small number of additional I/Os to
>             require splitting in nvme(4), however the NVMe I/O path is
>             very efficient so these additional I/Os will cause very
>             minimal (if any) difference in performance or CPU utilisation.
> 
>               This can be controller by the new sysctl
>             kern.nvme.max_optimal_sectorsize.
> 
>               MFC after:    1 week
>               Sponsored by: Multiplay
>               Differential Revision:       
>             https://reviews.freebsd.org/D4446
> 
>             Modified:
>               head/sys/dev/nvd/nvd.c
>               head/sys/dev/nvme/nvme.h
>               head/sys/dev/nvme/nvme_ns.c
>               head/sys/dev/nvme/nvme_sysctl.c
> 
> 
> 

-- 
Alexander Motin