Fwd: review: 4045 zfs write throttle & i/o scheduler performance work

Thu Aug 15 21:36:04 UTC 2013

-------- Original Message --------
Subject: 	review: 4045 zfs write throttle & i/o scheduler performance work
Date: 	Thu, 15 Aug 2013 11:35:00 -0700
From: 	Matthew Ahrens <mahrens at delphix.com>
To: 	illumos-zfs <zfs at lists.illumos.org>, Brian Behlendorf
<behlendorf1 at llnl.gov>, Martin Matuska <martin at matuska.org>, Will
Andrews <will at firepipe.net>, "Justin T. Gibbs" <gibbs at freebsd.org>,
Richard <ryao at cs.stonybrook.edu>, Jorgen Lundman <lundman at lundman.net>

http://cr.illumos.org/~webrev/csiden/illumos-throttle/
<http://cr.illumos.org/%7Ewebrev/csiden/illumos-throttle/>

This review mainly consists of two related performance improvements:

1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes:
sync read, sync write, async read, async write, and scrub/resilver.  The
scheduler issues a number of concurrent i/os from each class to the
device.  Once a class has been selected, an i/o is selected from this
class using either an elevator algorithem (async, scrub classes) or FIFO
(sync classes).  The number of concurrent async write i/os is tuned
dynamically based on i/o load, to achieve good sync i/o latency when
there is not a high load of writes, and good write throughput when there
is.  See the block comment in vdev_queue.c (reproduced below) for more
details.

2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent
delays when under constant load.  The new write throttle is based on the
amount of dirty data, rather than guesses about future performance of
the system.  When there is a lot of dirty data, each transaction (e.g.
write() syscall) will be delayed by the same small amount.  This
eliminates the "brick wall of wait" that the old write throttle could
hit, causing all transactions to wait several seconds until the next txg
opens.  One of the keys to the new write throttle is decrementing the
amount of dirty data as i/o completes, rather than at the end of
spa_sync().  Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. 
See the block comments in dsl_pool.c and above dmu_tx_delay()
(reproduced below) for more details.

This diff has several other effects, including:

 * the commonly-tuned global variable zfs_vdev_max_pending has been
removed; use per-class zfs_vdev_*_max_active values or
zfs_vdev_max_active instead.

 * the size of each txg (meaning the amount of dirty data written, and
thus the time it takes to write out) is now controlled differently. 
There is no longer an explicit time goal; the primary determinant is
amount of dirty data.  Systems that are under light or medium load will
now often see that a txg is always syncing, but the impact to
performance (e.g. read latency) is minimal.  Tune zfs_dirty_data_max and
zfs_dirty_data_sync to control this.

 * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc.  This improves latency by not allowing these
CPU-intensive tasks to consume all CPU (on machines with at least 4
CPU's; the percentage is rounded up).

--matt

APPENDIX: problems with the current i/o scheduler

The current ZFS i/o scheduler (vdev_queue.c) is deadline based.  The
problem with this is that if there are always i/os pending, then certain
classes of i/os can see very long delays.

For example, if there are always synchronous reads outstanding, then no
async writes will be serviced until they become "past due".  One symptom
of this situation is that each pass of the txg sync takes at least
several seconds (typically 3 seconds).

If many i/os become "past due" (their deadline is in the past), then we
must service all of these overdue i/os before any new i/os.  This
happens when we enqueue a batch of async writes for the txg sync, with
deadlines 2.5 seconds in the future.  If we can't complete all the i/os
in 2.5 seconds (e.g. because there were always reads pending), then
these i/os will become past due.  Now we must service all the "async"
writes (which could be hundreds of megabytes) before we service any
reads, introducing considerable latency to synchronous i/os (reads or
ZIL writes).

REFERENCE: block comments mentioned above:

        /*

         * ZFS Write Throttle

         * ------------------

         *

         * ZFS must limit the rate of incoming writes to the rate at
        which it is able

         * to sync data modifications to the backend storage. Throttling
        by too much

         * creates an artificial limit; throttling by too little can
        only be sustained

         * for short periods and would lead to highly lumpy performance.
        On a per-pool

         * basis, ZFS tracks the amount of modified (dirty) data. As
        operations change

         * data, the amount of dirty data increases; as ZFS syncs out
        data, the amount

         * of dirty data decreases. When the amount of dirty data exceeds a

         * predetermined threshold further modifications are blocked
        until the amount

         * of dirty data decreases (as data is synced out).

         *

         * The limit on dirty data is tunable, and should be adjusted
        according to

         * both the IO capacity and available memory of the system. The
        larger the

         * window, the more ZFS is able to aggregate and amortize
        metadata (and data)

         * changes. However, memory is a limited resource, and allowing
        for more dirty

         * data comes at the cost of keeping other useful data in memory
        (for example

         * ZFS data cached by the ARC).

         *

         * Implementation

         *

         * As buffers are modified dsl_pool_willuse_space() increments
        both the per-

         * txg (dp_dirty_pertxg[]) and poolwide (dp_dirty_total)
        accounting of

         * dirty space used; dsl_pool_dirty_space() decrements those
        values as data

         * is synced out from dsl_pool_sync(). While only the poolwide
        value is

         * relevant, the per-txg value is useful for debugging. The tunable

         * zfs_dirty_data_max determines the dirty space limit. Once
        that value is

         * exceeded, new writes are halted until space frees up.

         *

         * The zfs_dirty_data_sync tunable dictates the threshold at
        which we

         * ensure that there is a txg syncing (see the comment in txg.c
        for a full

         * description of transaction group stages).

         *

         * The IO scheduler uses both the dirty space limit and current
        amount of

         * dirty data as inputs. Those values affect the number of
        concurrent IOs ZFS

         * issues. See the comment in vdev_queue.c for details of the IO
        scheduler.

         *

         * The delay is also calculated based on the amount of dirty
        data.  See the

         * comment above dmu_tx_delay() for details.

         */

        /*

         * We delay transactions when we've determined that the backend
        storage

         * isn't able to accommodate the rate of incoming writes.

         *

         * If there is already a transaction waiting, we delay relative
        to when

         * that transaction finishes waiting.  This way the calculated
        min_time

         * is independent of the number of threads concurrently executing

         * transactions.

         *

         * If we are the only waiter, wait relative to when the transaction

         * started, rather than the current time.  This credits the
        transaction for

         * "time already served", e.g. reading indirect blocks.

         *

         * The minimum time for a transaction to take is calculated as:

         *     min_time = scale * (dirty - min) / (max - dirty)

         *     min_time is then capped at zfs_delay_max_ns.

         *

         * The delay has two degrees of freedom that can be adjusted via
        tunables.

         * The percentage of dirty data at which we start to delay is
        defined by

         * zfs_delay_min_dirty_percent. This should typically be at or above

         * zfs_vdev_async_write_active_max_dirty_percent so that we only
        start to

         * delay after writing at full speed has failed to keep up with
        the incoming

         * write rate. The scale of the curve is defined by
        zfs_delay_scale. Roughly

         * speaking, this variable determines the amount of delay at the
        midpoint of

         * the curve.

         *

         * delay

         *  10ms
        +-------------------------------------------------------------*+

         *       |                                                      
              *|

         *   9ms +                                                      
              *+

         *       |                                                      
              *|

         *   8ms +                                                      
              *+

         *       |                                                      
             * |

         *   7ms +                                                      
             * +

         *       |                                                      
             * |

         *   6ms +                                                      
             * +

         *       |                                                      
             * |

         *   5ms +                                                      
            *  +

         *       |                                                      
            *  |

         *   4ms +                                                      
            *  +

         *       |                                                      
            *  |

         *   3ms +                                                      
           *   +

         *       |                                                      
           *   |

         *   2ms +                                            
         (midpoint) *    +

         *       |                                                  |  
         **     |

         *   1ms +                                                  v
        ***       +

         *       |             zfs_delay_scale ---------->     ********
                |

         *     0
        +-------------------------------------*********----------------+

         *       0%                    <- zfs_dirty_data_max ->        
              100%

         *

         * Note that since the delay is added to the outstanding time
        remaining on the

         * most recent transaction, the delay is effectively the inverse
        of IOPS.

         * Here the midpoint of 500us translates to 2000 IOPS. The shape
        of the curve

         * was chosen such that small changes in the amount of
        accumulated dirty data

         * in the first 3/4 of the curve yield relatively small
        differences in the

         * amount of delay.

         *

         * The effects can be easier to understand when the amount of
        delay is

         * represented on a log scale:

         *

         * delay

         * 100ms
        +-------------------------------------------------------------++

         *       +                                                      
               +

         *       |                                                      
               |

         *       +                                                      
              *+

         *  10ms +                                                      
              *+

         *       +                                                      
            ** +

         *       |                                            
         (midpoint)  **  |

         *       +                                                  |  
          **    +

         *   1ms +                                                  v
        ****      +

         *       +             zfs_delay_scale ---------->        *****
                +

         *       |                                             ****    
                |

         *       +                                          ****        
               +

         * 100us +                                        **            
               +

         *       +                                       *              
               +

         *       |                                      *              
                |

         *       +                                     *                
               +

         *  10us +                                     *                
               +

         *       +                                                      
               +

         *       |                                                      
               |

         *       +                                                      
               +

         *      
        +--------------------------------------------------------------+

         *       0%                    <- zfs_dirty_data_max ->        
              100%

         *

         * Note here that only as the amount of dirty data approaches
        its limit does

         * the delay start to increase rapidly. The goal of a properly
        tuned system

         * should be to keep the amount of dirty data out of that range
        by first

         * ensuring that the appropriate limits are set the I/O
        scheduler to reach

         * optimal throughput on the backend storage, and then by
        changing the value

         * of zfs_delay_scale to increase the steepness of the curve.

         */

        /*

         * ZFS I/O Scheduler

         * ---------------

         *

         * ZFS issues I/O operations to leaf vdevs to satisfy and
        complete zios.  The

         * I/O scheduler determines when and in what order those
        operations are

         * issued.  The I/O scheduler divides operations into five I/O
        classes

         * prioritized in the following order: sync read, sync write,
        async read,

         * async write, and scrub/resilver.  Each queue defines the
        minimum and

         * maximum number of concurrent operations that may be issued to
        the device.

         * In addition, the device has an aggregate maximum. Note that
        the sum of the

         * per-queue minimums must not exceed the aggregate maximum, and
        if the

         * aggregate maximum is equal to or greater than the sum of the
        per-queue

         * maximums, the per-queue minimum has no effect.

         *

         * For many physical devices, throughput increases with the
        number of

         * concurrent operations, but latency typically suffers.
        Further, physical

         * devices typically have a limit at which more concurrent
        operations have no

         * effect on throughput or can actually cause it to decrease.

         *

         * The scheduler selects the next operation to issue by first
        looking for an

         * I/O class whose minimum has not been satisfied. Once all are
        satisfied and

         * the aggregate maximum has not been hit, the scheduler looks
        for classes

         * whose maximum has not been satisfied. Iteration through the
        I/O classes is

         * done in the order specified above. No further operations are
        issued if the

         * aggregate maximum number of concurrent operations has been
        hit or if there

         * are no operations queued for an I/O class that has not hit
        its maximum.

         * Every time an i/o is queued or an operation completes, the
        I/O scheduler

         * looks for new operations to issue.

         *

         * All I/O classes have a fixed maximum number of outstanding
        operations

         * except for the async write class. Asynchronous writes
        represent the data

         * that is committed to stable storage during the syncing stage for

         * transaction groups (see txg.c). Transaction groups enter the
        syncing state

         * periodically so the number of queued async writes will
        quickly burst up and

         * then bleed down to zero. Rather than servicing them as
        quickly as possible,

         * the I/O scheduler changes the maximum number of active async
        write i/os

         * according to the amount of dirty data in the pool (see
        dsl_pool.c). Since

         * both throughput and latency typically increase with the number of

         * concurrent operations issued to physical devices, reducing
        the burstiness

         * in the number of concurrent operations also stabilizes the
        response time of

         * operations from other -- and in particular synchronous --
        queues. In broad

         * strokes, the I/O scheduler will issue more concurrent
        operations from the

         * async write queue as there's more dirty data in the pool.

         *

         * Async Writes

         *

         * The number of concurrent operations issued for the async
        write I/O class

         * follows a piece-wise linear function defined by a few
        adjustable points.

         *

         *        |                   o---------| <--
        zfs_vdev_async_write_max_active

         *   ^    |                  /^         |

         *   |    |                 / |         |

         * active |                /  |         |

         *  I/O   |               /   |         |

         * count  |              /    |         |

         *        |             /     |         |

         *        |------------o      |         | <--
        zfs_vdev_async_write_min_active

         *       0|____________^______|_________|

         *        0%           |      |       100% of zfs_dirty_data_max

         *                     |      |

         *                     |      `--
        zfs_vdev_async_write_active_max_dirty_percent

         *                     `---------
        zfs_vdev_async_write_active_min_dirty_percent

         *

         * Until the amount of dirty data exceeds a minimum percentage
        of the dirty

         * data allowed in the pool, the I/O scheduler will limit the
        number of

         * concurrent operations to the minimum. As that threshold is
        crossed, the

         * number of concurrent operations issued increases linearly to
        the maximum at

         * the specified maximum percentage of the dirty data allowed in
        the pool.

         *

         * Ideally, the amount of dirty data on a busy pool will stay in
        the sloped

         * part of the function between
        zfs_vdev_async_write_active_min_dirty_percent

         * and zfs_vdev_async_write_active_max_dirty_percent. If it
        exceeds the

         * maximum percentage, this indicates that the rate of incoming
        data is

         * greater than the rate that the backend storage can handle. In
        this case, we

         * must further throttle incoming writes (see dmu_tx_delay() for
        details).

         */