svn commit: r319749 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs

Fri Jun 9 15:27:23 UTC 2017

Author: avg
Date: Fri Jun  9 15:27:22 2017
New Revision: 319749
URL: https://svnweb.freebsd.org/changeset/base/319749

Log:
  MFV r319739: 8005 poor performance of 1MB writes on certain RAID-Z configurations
  
  illumos/illumos-gate at 5b062782532a1d5961c4a4b655906e1238c7c908
  https://github.com/illumos/illumos-gate/commit/5b062782532a1d5961c4a4b655906e1238c7c908
  
  https://www.illumos.org/issues/8005
    RAID-Z requires that space be allocated in multiples of P+1 sectors,
    because this is the minimum size block that can have the required amount
    of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
    sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
    is a unit of 2^ashift bytes, typically 512B or 4KB.
    To satisfy this constraint, the allocation size is rounded up to the
    proper multiple, resulting in up to 3 "pad sectors" at the end of some
    blocks. The contents of these pad sectors are not used, so we do not
    need to read or write these sectors. However, some storage hardware
    performs much worse (around 1/2 as fast) on mostly-contiguous writes
    when there are small gaps of non-overwritten data between the writes.
    Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
    include pad sectors. If writing a pad sector will fill the gap between
    two (required) writes, we will issue the optional zio, thus doubling
    performance. The gap-filling performance improvement was introduced in
    July 2009.
    Writing the optional zio is done by the io aggregation code in
    vdev_queue.c. The problem is that it is also subject to the limit on
    the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
    default 128KB. For a given block, if the amount of data plus padding
    written to a leaf device exceeds zfs_vdev_aggregation_limit, the
    optional zio will not be written, resulting in a ~2x performance
    degradation.
    The problem occurs only for certain values of ashift, compressed block
    size, and RAID-Z configuration (number of parity and data disks). It
    cannot occur with the default recordsize=128KB. If compression is
    enabled, all configurations with recordsize=1MB or larger will be
    impacted to some degree.
    The problem notably occurs with recordsize=1MB, compression=off, with 10
    disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore
  
  Reviewed by: Saso Kiselkov <saso.kiselkov at nexenta.com>
  Reviewed by: George Wilson <george.wilson at delphix.com>
  Reviewed by: Pavel Zakharov <pavel.zakharov at delphix.com>
  Approved by: Robert Mustacchi <rm at joyent.com>
  Author: Matthew Ahrens <mahrens at delphix.com>
  MFC after:	10 days

Modified:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c
Directory Properties:
  head/sys/cddl/contrib/opensolaris/   (props changed)

Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c
==============================================================================

--- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c	Fri Jun  9 15:26:03 2017	(r319748)
+++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c	Fri Jun  9 15:27:22 2017	(r319749)
@@ -24,7 +24,7 @@
  */
 
 /*
- * Copyright (c) 2012, 2014 by Delphix. All rights reserved.
+ * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  * Copyright (c) 2014 Integros [integros.com]
  */
 
@@ -681,7 +681,7 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio)
 
 	/*
 	 * Walk backwards through sufficiently contiguous I/Os
-	 * recording the last non-option I/O.
+	 * recording the last non-optional I/O.
 	 */
 	flags = zio->io_flags & ZIO_FLAG_AGG_INHERIT;
 	t = vdev_queue_type_tree(vq, zio->io_type);
@@ -704,10 +704,14 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio)
 
 	/*
 	 * Walk forward through sufficiently contiguous I/Os.
+	 * The aggregation limit does not apply to optional i/os, so that
+	 * we can issue contiguous writes even if they are larger than the
+	 * aggregation limit.
 	 */
 	while ((dio = AVL_NEXT(t, last)) != NULL &&
 	    (dio->io_flags & ZIO_FLAG_AGG_INHERIT) == flags &&
-	    IO_SPAN(first, dio) <= zfs_vdev_aggregation_limit &&
+	    (IO_SPAN(first, dio) <= zfs_vdev_aggregation_limit ||
+	    (dio->io_flags & ZIO_FLAG_OPTIONAL)) &&
 	    IO_GAP(last, dio) <= maxgap) {
 		last = dio;
 		if (!(last->io_flags & ZIO_FLAG_OPTIONAL))
@@ -739,10 +743,16 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio)
 	}
 
 	if (stretch) {
-		/* This may be a no-op. */
+		/*
+		 * We are going to include an optional io in our aggregated
+		 * span, thus closing the write gap.  Only mandatory i/os can
+		 * start aggregated spans, so make sure that the next i/o
+		 * after our span is mandatory.
+		 */
 		dio = AVL_NEXT(t, last);
 		dio->io_flags &= ~ZIO_FLAG_OPTIONAL;
 	} else {
+		/* do not include the optional i/o */
 		while (last != mandatory && last != first) {
 			ASSERT(last->io_flags & ZIO_FLAG_OPTIONAL);
 			last = AVL_PREV(t, last);
@@ -754,7 +764,7 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio)
 		return (NULL);
 
 	size = IO_SPAN(first, last);
-	ASSERT3U(size, <=, zfs_vdev_aggregation_limit);
+	ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
 
 	abuf = zio_buf_alloc_nowait(size);
 	if (abuf == NULL)