svn commit: r324011 - in head: cddl/contrib/opensolaris/cmd/ztest sys/cddl/contrib/opensolaris/uts/common/fs/zfs sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys sys/cddl/contrib/opensolaris/uts/...
O. Hartmann
ohartmann at walstatt.org
Tue Sep 26 13:48:53 UTC 2017
On Tue, 26 Sep 2017 11:04:08 +0000 (UTC)
Andriy Gapon <avg at FreeBSD.org> wrote:
> Author: avg
> Date: Tue Sep 26 11:04:08 2017
> New Revision: 324011
> URL: https://svnweb.freebsd.org/changeset/base/324011
>
> Log:
> MFV r323535: 8585 improve batching done in zil_commit()
>
> FreeBSD notes:
> - this MFV reverts FreeBSD commit r314549 to make the merge easier
> - at present our emulation of cv_timedwait_hires is rather poor,
> so I elected to use cv_timedwait_sbt directly
> Please see the differential revision for details.
> Unfortunately, I did not get any positive reviews, so there could be
> bugs in the FreeBSD-specific piece of the merge.
> Hence, the long MFC timeout.
>
> illumos/illumos-gate at 1271e4b10dfaaed576c08a812f466f6e81370e5e
> https://github.com/illumos/illumos-gate/commit/1271e4b10dfaaed576c08a812f466f6e81370e5e
>
> https://www.illumos.org/issues/8585
> The current implementation of zil_commit() can introduce significant
> latency, beyond what is inherent due to the latency of the underlying
> storage. The additional latency comes from two main problems:
> 1. When there's outstanding ZIL blocks being written (i.e. there's
> already a "writer thread" in progress), then any new calls to
> zil_commit() will block waiting for the currently oustanding ZIL
> blocks to complete. The blocks written for each "writer thread" is
> coined a "batch", and there can only ever be a single "batch" being
> written at a time. When a batch is being written, any new ZIL
> transactions will have to wait for the next batch to be written,
> which won't occur until the current batch finishes.
> As a result, the underlying storage may not be used as efficiently
> as possible. While "new" threads enter zil_commit() and are blocked
> waiting for the next batch, it's possible that the underlying
> storage isn't fully utilized by the current batch of ZIL blocks. In
> that case, it'd be better to allow these new threads to generate
> (and issue) a new ZIL block, such that it could be serviced by the
> underlying storage concurrently with the other ZIL blocks that are
> being serviced.
> 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
> to complete, prior to zil_commit() returning. The size of any given
> batch is proportional to the number of ZIL transaction in the queue
> at the time that the batch starts processing the queue; which
> doesn't occur until the previous batch completes. Thus, if there's a
> lot of transactions in the queue, the batch could be composed of
> many ZIL blocks, and each call to zil_commit() will have to wait for
> all of these writes to complete (even if the thread calling
> zil_commit() only cared about one of the transactions in the batch).
>
> Reviewed by: Brad Lewis <brad.lewis at delphix.com>
> Reviewed by: Matt Ahrens <mahrens at delphix.com>
> Reviewed by: George Wilson <george.wilson at delphix.com>
> Approved by: Dan McDonald <danmcd at joyent.com>
> Author: Prakash Surya <prakash.surya at delphix.com>
>
> MFC after: 1 month
> Differential Revision: https://reviews.freebsd.org/D12355
>
> Modified:
> head/cddl/contrib/opensolaris/cmd/ztest/ztest.c
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
> head/sys/cddl/contrib/opensolaris/uts/common/sys/debug.h
> Directory Properties:
> head/cddl/contrib/opensolaris/ (props changed)
> head/sys/cddl/contrib/opensolaris/ (props changed)
>
> Modified: head/cddl/contrib/opensolaris/cmd/ztest/ztest.c
> ==============================================================================
> --- head/cddl/contrib/opensolaris/cmd/ztest/ztest.c Tue Sep 26
> 09:34:18 2017 (r324010) +++
> head/cddl/contrib/opensolaris/cmd/ztest/ztest.c Tue Sep 26 11:04:08
> 2017 (r324011) @@ -1825,13 +1825,14 @@ ztest_get_done(zgd_t *zgd, int
> error) ztest_object_unlock(zd, object);
> if (error == 0 && zgd->zgd_bp)
> - zil_add_block(zgd->zgd_zilog, zgd->zgd_bp);
> + zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
>
> umem_free(zgd, sizeof (*zgd));
> }
>
> static int
> -ztest_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio)
> +ztest_get_data(void *arg, lr_write_t *lr, char *buf, struct lwb *lwb,
> + zio_t *zio)
> {
> ztest_ds_t *zd = arg;
> objset_t *os = zd->zd_os;
> @@ -1845,6 +1846,10 @@ ztest_get_data(void *arg, lr_write_t *lr, char *buf, z
> zgd_t *zgd;
> int error;
>
> + ASSERT3P(lwb, !=, NULL);
> + ASSERT3P(zio, !=, NULL);
> + ASSERT3U(size, !=, 0);
> +
> ztest_object_lock(zd, object, RL_READER);
> error = dmu_bonus_hold(os, object, FTAG, &db);
> if (error) {
> @@ -1865,7 +1870,7 @@ ztest_get_data(void *arg, lr_write_t *lr, char *buf, z
> db = NULL;
>
> zgd = umem_zalloc(sizeof (*zgd), UMEM_NOFAIL);
> - zgd->zgd_zilog = zd->zd_zilog;
> + zgd->zgd_lwb = lwb;
> zgd->zgd_private = zd;
>
> if (buf != NULL) { /* immediate write */
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c Tue Sep
> 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c Tue Sep 26
> 11:04:08 2017 (r324011) @@ -1728,6 +1728,13 @@
> dmu_sync_late_arrival(zio_t *pio, objset_t *os, dmu_sy return
> (SET_ERROR(EIO)); }
> + /*
> + * In order to prevent the zgd's lwb from being free'd prior to
> + * dmu_sync_late_arrival_done() being called, we have to ensure
> + * the lwb's "max txg" takes this tx's txg into account.
> + */
> + zil_lwb_add_txg(zgd->zgd_lwb, dmu_tx_get_txg(tx));
> +
> dsa = kmem_alloc(sizeof (dmu_sync_arg_t), KM_SLEEP);
> dsa->dsa_dr = NULL;
> dsa->dsa_done = done;
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h Tue
> Sep 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h Tue Sep
> 26 11:04:08 2017 (r324011) @@ -920,7 +920,7 @@ uint64_t
> dmu_tx_get_txg(dmu_tx_t *tx);
> * {zfs,zvol,ztest}_get_done() args
> */
> typedef struct zgd {
> - struct zilog *zgd_zilog;
> + struct lwb *zgd_lwb;
> struct blkptr *zgd_bp;
> dmu_buf_t *zgd_db;
> struct rl *zgd_rl;
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h Tue
> Sep 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h Tue Sep
> 26 11:04:08 2017 (r324011) @@ -40,6 +40,7 @@ extern "C" {
> struct dsl_pool;
> struct dsl_dataset;
> +struct lwb;
>
> /*
> * Intent log format:
> @@ -140,6 +141,7 @@ typedef enum zil_create {
> /*
> * Intent log transaction types and record structures
> */
> +#define TX_COMMIT 0 /* Commit marker (no
> on-disk state) */ #define TX_CREATE 1 /* Create
> file */ #define TX_MKDIR 2 /* Make directory */
> #define TX_MKXATTR 3 /* Make XATTR directory */
> @@ -388,7 +390,8 @@ typedef int zil_parse_blk_func_t(zilog_t *zilog, blkpt
> typedef int zil_parse_lr_func_t(zilog_t *zilog, lr_t *lr, void *arg,
> uint64_t txg);
> typedef int zil_replay_func_t();
> -typedef int zil_get_data_t(void *arg, lr_write_t *lr, char *dbuf, zio_t
> *zio); +typedef int zil_get_data_t(void *arg, lr_write_t *lr, char *dbuf,
> + struct lwb *lwb, zio_t *zio);
>
> extern int zil_parse(zilog_t *zilog, zil_parse_blk_func_t *parse_blk_func,
> zil_parse_lr_func_t *parse_lr_func, void *arg, uint64_t txg);
> @@ -427,7 +430,8 @@ extern void zil_clean(zilog_t *zilog, uint64_t
> synced_ extern int zil_suspend(const char *osname, void **cookiep);
> extern void zil_resume(void *cookie);
>
> -extern void zil_add_block(zilog_t *zilog, const blkptr_t *bp);
> +extern void zil_lwb_add_block(struct lwb *lwb, const blkptr_t *bp);
> +extern void zil_lwb_add_txg(struct lwb *lwb, uint64_t txg);
> extern int zil_bp_tree_add(zilog_t *zilog, const blkptr_t *bp);
>
> extern void zil_set_sync(zilog_t *zilog, uint64_t syncval);
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
> Tue Sep 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h Tue
> Sep 26 11:04:08 2017 (r324011) @@ -20,7 +20,7 @@ */
> /*
> * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights
> reserved.
> - * Copyright (c) 2012 by Delphix. All rights reserved.
> + * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
> * Copyright (c) 2014 Integros [integros.com]
> */
>
> @@ -37,22 +37,76 @@ extern "C" {
> #endif
>
> /*
> - * Log write buffer.
> + * Possbile states for a given lwb structure. An lwb will start out in
> + * the "closed" state, and then transition to the "opened" state via a
> + * call to zil_lwb_write_open(). After the lwb is "open", it can
> + * transition into the "issued" state via zil_lwb_write_issue(). After
> + * the lwb's zio completes, and the vdev's are flushed, the lwb will
> + * transition into the "done" state via zil_lwb_write_done(), and the
> + * structure eventually freed.
> */
> +typedef enum {
> + LWB_STATE_CLOSED,
> + LWB_STATE_OPENED,
> + LWB_STATE_ISSUED,
> + LWB_STATE_DONE,
> + LWB_NUM_STATES
> +} lwb_state_t;
> +
> +/*
> + * Log write block (lwb)
> + *
> + * Prior to an lwb being issued to disk via zil_lwb_write_issue(), it
> + * will be protected by the zilog's "zl_writer_lock". Basically, prior
> + * to it being issued, it will only be accessed by the thread that's
> + * holding the "zl_writer_lock". After the lwb is issued, the zilog's
> + * "zl_lock" is used to protect the lwb against concurrent access.
> + */
> typedef struct lwb {
> zilog_t *lwb_zilog; /* back pointer to log
> struct */ blkptr_t lwb_blk; /* on disk address of this log blk
> */ boolean_t lwb_slog; /* lwb_blk is on SLOG device */
> int lwb_nused; /* # used bytes in buffer */
> int lwb_sz; /* size of block and
> buffer */
> + lwb_state_t lwb_state; /* the state of this lwb */
> char *lwb_buf; /* log write buffer */
> - zio_t *lwb_zio; /* zio for this buffer */
> + zio_t *lwb_write_zio; /* zio for the lwb
> buffer */
> + zio_t *lwb_root_zio; /* root zio for lwb write
> and flushes */ dmu_tx_t *lwb_tx; /* tx for log block allocation
> */ uint64_t lwb_max_txg; /* highest txg in this lwb */
> list_node_t lwb_node; /* zilog->zl_lwb_list linkage */
> + list_t lwb_waiters; /* list of
> zil_commit_waiter's */
> + avl_tree_t lwb_vdev_tree; /* vdevs to flush after lwb
> write */
> + kmutex_t lwb_vdev_lock; /* protects lwb_vdev_tree */
> + hrtime_t lwb_issued_timestamp; /* when was the lwb issued? */
> } lwb_t;
>
> /*
> + * ZIL commit waiter.
> + *
> + * This structure is allocated each time zil_commit() is called, and is
> + * used by zil_commit() to communicate with other parts of the ZIL, such
> + * that zil_commit() can know when it safe for it return. For more
> + * details, see the comment above zil_commit().
> + *
> + * The "zcw_lock" field is used to protect the commit waiter against
> + * concurrent access. This lock is often acquired while already holding
> + * the zilog's "zl_writer_lock" or "zl_lock"; see the functions
> + * zil_process_commit_list() and zil_lwb_flush_vdevs_done() as examples
> + * of this. Thus, one must be careful not to acquire the
> + * "zl_writer_lock" or "zl_lock" when already holding the "zcw_lock";
> + * e.g. see the zil_commit_waiter_timeout() function.
> + */
> +typedef struct zil_commit_waiter {
> + kcondvar_t zcw_cv; /* signalled when "done" */
> + kmutex_t zcw_lock; /* protects fields of this struct */
> + list_node_t zcw_node; /* linkage in lwb_t:lwb_waiter
> list */
> + lwb_t *zcw_lwb; /* back pointer to lwb when
> linked */
> + boolean_t zcw_done; /* B_TRUE when "done", else
> B_FALSE */
> + int zcw_zio_error; /* contains the zio
> io_error value */ +} zil_commit_waiter_t;
> +
> +/*
> * Intent log transaction lists
> */
> typedef struct itxs {
> @@ -94,20 +148,20 @@ struct zilog {
> const zil_header_t *zl_header; /* log header buffer */
> objset_t *zl_os; /* object set we're logging */
> zil_get_data_t *zl_get_data; /* callback to get object
> content */
> - zio_t *zl_root_zio; /* log writer root zio */
> + lwb_t *zl_last_lwb_opened; /* most recent lwb opened
> */
> + hrtime_t zl_last_lwb_latency; /* zio latency of last lwb done
> */ uint64_t zl_lr_seq; /* on-disk log record sequence number */
> uint64_t zl_commit_lr_seq; /* last committed on-disk lr seq */
> uint64_t zl_destroy_txg; /* txg of last zil_destroy()
> */ uint64_t zl_replayed_seq[TXG_SIZE]; /* last replayed rec seq */
> uint64_t zl_replaying_seq; /* current replay seq number */
> uint32_t zl_suspend; /* log suspend count */
> - kcondvar_t zl_cv_writer; /* log writer thread
> completion */ kcondvar_t zl_cv_suspend; /* log suspend
> completion */ uint8_t zl_suspending; /* log is
> currently suspending */ uint8_t zl_keep_first; /* keep
> first log block in destroy */ uint8_t zl_replay; /*
> replaying records while set */ uint8_t zl_stop_sync; /*
> for debugging */
> - uint8_t zl_writer; /* boolean: write setup in
> progress */
> + kmutex_t zl_writer_lock; /* single writer, per ZIL, at
> a time */ uint8_t zl_logbias; /* latency or throughput
> */ uint8_t zl_sync; /* synchronous or asynchronous */
> int zl_parse_error; /* last zil_parse() error
> */ @@ -115,15 +169,10 @@ struct zilog {
> uint64_t zl_parse_lr_seq; /* highest lr seq on last parse */
> uint64_t zl_parse_blk_count; /* number of blocks parsed */
> uint64_t zl_parse_lr_count; /* number of log records parsed */
> - uint64_t zl_next_batch; /* next batch number */
> - uint64_t zl_com_batch; /* committed batch number */
> - kcondvar_t zl_cv_batch[2]; /* batch condition
> variables */ itxg_t zl_itxg[TXG_SIZE]; /* intent log txg
> chains */ list_t zl_itx_commit_list; /* itx list to be
> committed */ uint64_t zl_cur_used; /* current commit log size
> used */ list_t zl_lwb_list; /* in-flight log write list
> */
> - kmutex_t zl_vdev_lock; /* protects zl_vdev_tree */
> - avl_tree_t zl_vdev_tree; /* vdevs to flush in
> zil_commit() */ avl_tree_t zl_bp_tree; /* track bps during log
> parse */ clock_t zl_replay_time; /* lbolt of when
> replay started */ uint64_t zl_replay_blks; /* number of log
> blocks replayed */ @@ -131,6 +180,7 @@ struct zilog {
> uint_t zl_prev_blks[ZIL_PREV_BLKS]; /* size - sector
> rounded */ uint_t zl_prev_rotor; /* rotor for zl_prev[]
> */ txg_node_t zl_dirty_link; /* protected by dp_dirty_zilogs
> list */
> + uint64_t zl_dirty_max_txg; /* highest txg used to dirty zilog
> */ };
>
> typedef struct zil_bp_node {
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h Tue
> Sep 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h Tue Sep
> 26 11:04:08 2017 (r324011) @@ -593,6 +593,7 @@ extern enum
> zio_checksum zio_checksum_dedup_select(spa extern enum zio_compress
> zio_compress_select(spa_t *spa, enum zio_compress child, enum zio_compress
> parent); +extern void zio_cancel(zio_t *zio);
> extern void zio_suspend(spa_t *spa, zio_t *zio);
> extern int zio_resume(spa_t *spa);
> extern void zio_resume_wait(spa_t *spa);
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c Tue Sep
> 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c Tue Sep 26
> 11:04:08 2017 (r324011) @@ -163,7 +163,7 @@ txg_fini(dsl_pool_t *dp)
> tx_state_t *tx = &dp->dp_tx; int c;
>
> - ASSERT(tx->tx_threads == 0);
> + ASSERT0(tx->tx_threads);
>
> mutex_destroy(&tx->tx_sync_lock);
>
> @@ -204,7 +204,7 @@ txg_sync_start(dsl_pool_t *dp)
>
> dprintf("pool %p\n", dp);
>
> - ASSERT(tx->tx_threads == 0);
> + ASSERT0(tx->tx_threads);
>
> tx->tx_threads = 2;
>
> @@ -265,7 +265,7 @@ txg_sync_stop(dsl_pool_t *dp)
> /*
> * Finish off any work in progress.
> */
> - ASSERT(tx->tx_threads == 2);
> + ASSERT3U(tx->tx_threads, ==, 2);
>
> /*
> * We need to ensure that we've vacated the deferred space_maps.
> @@ -277,7 +277,7 @@ txg_sync_stop(dsl_pool_t *dp)
> */
> mutex_enter(&tx->tx_sync_lock);
>
> - ASSERT(tx->tx_threads == 2);
> + ASSERT3U(tx->tx_threads, ==, 2);
>
> tx->tx_exiting = 1;
>
> @@ -616,7 +616,7 @@ txg_wait_synced(dsl_pool_t *dp, uint64_t txg)
> ASSERT(!dsl_pool_config_held(dp));
>
> mutex_enter(&tx->tx_sync_lock);
> - ASSERT(tx->tx_threads == 2);
> + ASSERT3U(tx->tx_threads, ==, 2);
> if (txg == 0)
> txg = tx->tx_open_txg + TXG_DEFER_SIZE;
> if (tx->tx_sync_txg_waiting < txg)
> @@ -641,7 +641,7 @@ txg_wait_open(dsl_pool_t *dp, uint64_t txg)
> ASSERT(!dsl_pool_config_held(dp));
>
> mutex_enter(&tx->tx_sync_lock);
> - ASSERT(tx->tx_threads == 2);
> + ASSERT3U(tx->tx_threads, ==, 2);
> if (txg == 0)
> txg = tx->tx_open_txg + 1;
> if (tx->tx_quiesce_txg_waiting < txg)
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
> Tue Sep 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c Tue
> Sep 26 11:04:08 2017 (r324011) @@ -76,6 +76,7 @@ #include <sys/acl.h>
> #include <sys/vmmeter.h>
> #include <vm/vm_param.h>
> +#include <sys/zil.h>
>
> /*
> * Programming rules.
> @@ -1276,7 +1277,7 @@ zfs_get_done(zgd_t *zgd, int error)
> VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
>
> if (error == 0 && zgd->zgd_bp)
> - zil_add_block(zgd->zgd_zilog, zgd->zgd_bp);
> + zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
>
> kmem_free(zgd, sizeof (zgd_t));
> }
> @@ -1289,7 +1290,7 @@ static int zil_fault_io = 0;
> * Get data to generate a TX_WRITE intent log record.
> */
> int
> -zfs_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio)
> +zfs_get_data(void *arg, lr_write_t *lr, char *buf, struct lwb *lwb, zio_t
> *zio) {
> zfsvfs_t *zfsvfs = arg;
> objset_t *os = zfsvfs->z_os;
> @@ -1301,8 +1302,9 @@ zfs_get_data(void *arg, lr_write_t *lr, char *buf, zio
> zgd_t *zgd;
> int error = 0;
>
> - ASSERT(zio != NULL);
> - ASSERT(size != 0);
> + ASSERT3P(lwb, !=, NULL);
> + ASSERT3P(zio, !=, NULL);
> + ASSERT3U(size, !=, 0);
>
> /*
> * Nothing to do if the file has been removed
> @@ -1320,7 +1322,7 @@ zfs_get_data(void *arg, lr_write_t *lr, char *buf, zio
> }
>
> zgd = (zgd_t *)kmem_zalloc(sizeof (zgd_t), KM_SLEEP);
> - zgd->zgd_zilog = zfsvfs->z_log;
> + zgd->zgd_lwb = lwb;
> zgd->zgd_private = zp;
>
> /*
>
> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
> ==============================================================================
> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c Tue Sep
> 26 09:34:18 2017 (r324010) +++
> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c Tue Sep 26
> 11:04:08 2017 (r324011) @@ -42,32 +42,53 @@ #include <sys/abd.h>
>
> /*
> - * The zfs intent log (ZIL) saves transaction records of system calls
> - * that change the file system in memory with enough information
> - * to be able to replay them. These are stored in memory until
> - * either the DMU transaction group (txg) commits them to the stable pool
> - * and they can be discarded, or they are flushed to the stable log
> - * (also in the pool) due to a fsync, O_DSYNC or other synchronous
> - * requirement. In the event of a panic or power fail then those log
> - * records (transactions) are replayed.
> + * The ZFS Intent Log (ZIL) saves "transaction records" (itxs) of system
> + * calls that change the file system. Each itx has enough information to
> + * be able to replay them after a system crash, power loss, or
> + * equivalent failure mode. These are stored in memory until either:
> *
> - * There is one ZIL per file system. Its on-disk (pool) format consists
> - * of 3 parts:
> + * 1. they are committed to the pool by the DMU transaction group
> + * (txg), at which point they can be discarded; or
> + * 2. they are committed to the on-disk ZIL for the dataset being
> + * modified (e.g. due to an fsync, O_DSYNC, or other synchronous
> + * requirement).
> *
> - * - ZIL header
> - * - ZIL blocks
> - * - ZIL records
> + * In the event of a crash or power loss, the itxs contained by each
> + * dataset's on-disk ZIL will be replayed when that dataset is first
> + * instantianted (e.g. if the dataset is a normal fileystem, when it is
> + * first mounted).
> *
> - * A log record holds a system call transaction. Log blocks can
> - * hold many log records and the blocks are chained together.
> - * Each ZIL block contains a block pointer (blkptr_t) to the next
> - * ZIL block in the chain. The ZIL header points to the first
> - * block in the chain. Note there is not a fixed place in the pool
> - * to hold blocks. They are dynamically allocated and freed as
> - * needed from the blocks available. Figure X shows the ZIL structure:
> + * As hinted at above, there is one ZIL per dataset (both the in-memory
> + * representation, and the on-disk representation). The on-disk format
> + * consists of 3 parts:
> + *
> + * - a single, per-dataset, ZIL header; which points to a chain of
> + * - zero or more ZIL blocks; each of which contains
> + * - zero or more ZIL records
> + *
> + * A ZIL record holds the information necessary to replay a single
> + * system call transaction. A ZIL block can hold many ZIL records, and
> + * the blocks are chained together, similarly to a singly linked list.
> + *
> + * Each ZIL block contains a block pointer (blkptr_t) to the next ZIL
> + * block in the chain, and the ZIL header points to the first block in
> + * the chain.
> + *
> + * Note, there is not a fixed place in the pool to hold these ZIL
> + * blocks; they are dynamically allocated and freed as needed from the
> + * blocks available on the pool, though they can be preferentially
> + * allocated from a dedicated "log" vdev.
> */
>
> /*
> + * This controls the amount of time that a ZIL block (lwb) will remain
> + * "open" when it isn't "full", and it has a thread waiting for it to be
> + * committed to stable storage. Please refer to the zil_commit_waiter()
> + * function (and the comments within it) for more details.
> + */
> +int zfs_commit_timeout_pct = 5;
> +
> +/*
> * Disable intent logging replay. This global ZIL switch affects all pools.
> */
> int zil_replay_disable = 0;
> @@ -98,6 +119,7 @@ SYSCTL_QUAD(_vfs_zfs, OID_AUTO, zil_slog_bulk, CTLFLAG
> &zil_slog_bulk, 0, "Maximal SLOG commit size with sync priority");
>
> static kmem_cache_t *zil_lwb_cache;
> +static kmem_cache_t *zil_zcw_cache;
>
> #define LWB_EMPTY(lwb) ((BP_GET_LSIZE(&lwb->lwb_blk) - \
> sizeof (zil_chain_t)) == (lwb->lwb_sz - lwb->lwb_nused))
> @@ -445,6 +467,20 @@ zil_free_log_record(zilog_t *zilog, lr_t *lrc, void *t
> return (0);
> }
>
> +static int
> +zil_lwb_vdev_compare(const void *x1, const void *x2)
> +{
> + const uint64_t v1 = ((zil_vdev_node_t *)x1)->zv_vdev;
> + const uint64_t v2 = ((zil_vdev_node_t *)x2)->zv_vdev;
> +
> + if (v1 < v2)
> + return (-1);
> + if (v1 > v2)
> + return (1);
> +
> + return (0);
> +}
> +
> static lwb_t *
> zil_alloc_lwb(zilog_t *zilog, blkptr_t *bp, boolean_t slog, uint64_t txg)
> {
> @@ -454,10 +490,13 @@ zil_alloc_lwb(zilog_t *zilog, blkptr_t *bp, boolean_t
> lwb->lwb_zilog = zilog;
> lwb->lwb_blk = *bp;
> lwb->lwb_slog = slog;
> + lwb->lwb_state = LWB_STATE_CLOSED;
> lwb->lwb_buf = zio_buf_alloc(BP_GET_LSIZE(bp));
> lwb->lwb_max_txg = txg;
> - lwb->lwb_zio = NULL;
> + lwb->lwb_write_zio = NULL;
> + lwb->lwb_root_zio = NULL;
> lwb->lwb_tx = NULL;
> + lwb->lwb_issued_timestamp = 0;
> if (BP_GET_CHECKSUM(bp) == ZIO_CHECKSUM_ZILOG2) {
> lwb->lwb_nused = sizeof (zil_chain_t);
> lwb->lwb_sz = BP_GET_LSIZE(bp);
> @@ -470,9 +509,54 @@ zil_alloc_lwb(zilog_t *zilog, blkptr_t *bp, boolean_t
> list_insert_tail(&zilog->zl_lwb_list, lwb);
> mutex_exit(&zilog->zl_lock);
>
> + ASSERT(!MUTEX_HELD(&lwb->lwb_vdev_lock));
> + ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
> + ASSERT(list_is_empty(&lwb->lwb_waiters));
> +
> return (lwb);
> }
>
> +static void
> +zil_free_lwb(zilog_t *zilog, lwb_t *lwb)
> +{
> + ASSERT(MUTEX_HELD(&zilog->zl_lock));
> + ASSERT(!MUTEX_HELD(&lwb->lwb_vdev_lock));
> + ASSERT(list_is_empty(&lwb->lwb_waiters));
> +
> + if (lwb->lwb_state == LWB_STATE_OPENED) {
> + avl_tree_t *t = &lwb->lwb_vdev_tree;
> + void *cookie = NULL;
> + zil_vdev_node_t *zv;
> +
> + while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
> + kmem_free(zv, sizeof (*zv));
> +
> + ASSERT3P(lwb->lwb_root_zio, !=, NULL);
> + ASSERT3P(lwb->lwb_write_zio, !=, NULL);
> +
> + zio_cancel(lwb->lwb_root_zio);
> + zio_cancel(lwb->lwb_write_zio);
> +
> + lwb->lwb_root_zio = NULL;
> + lwb->lwb_write_zio = NULL;
> + } else {
> + ASSERT3S(lwb->lwb_state, !=, LWB_STATE_ISSUED);
> + }
> +
> + ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
> + ASSERT3P(lwb->lwb_write_zio, ==, NULL);
> + ASSERT3P(lwb->lwb_root_zio, ==, NULL);
> +
> + /*
> + * Clear the zilog's field to indicate this lwb is no longer
> + * valid, and prevent use-after-free errors.
> + */
> + if (zilog->zl_last_lwb_opened == lwb)
> + zilog->zl_last_lwb_opened = NULL;
> +
> + kmem_cache_free(zil_lwb_cache, lwb);
> +}
> +
> /*
> * Called when we create in-memory log transactions so that we know
> * to cleanup the itxs at the end of spa_sync().
> @@ -483,12 +567,16 @@ zilog_dirty(zilog_t *zilog, uint64_t txg)
> dsl_pool_t *dp = zilog->zl_dmu_pool;
> dsl_dataset_t *ds = dmu_objset_ds(zilog->zl_os);
>
> + ASSERT(spa_writeable(zilog->zl_spa));
> +
> if (ds->ds_is_snapshot)
> panic("dirtying snapshot!");
>
> if (txg_list_add(&dp->dp_dirty_zilogs, zilog, txg)) {
> /* up the hold count until we can be written out */
> dmu_buf_add_ref(ds->ds_dbuf, zilog);
> +
> + zilog->zl_dirty_max_txg = MAX(txg, zilog->zl_dirty_max_txg);
> }
> }
>
> @@ -556,7 +644,7 @@ zil_create(zilog_t *zilog)
> */
> if (BP_IS_HOLE(&blk) || BP_SHOULD_BYTESWAP(&blk)) {
> tx = dmu_tx_create(zilog->zl_os);
> - VERIFY(dmu_tx_assign(tx, TXG_WAIT) == 0);
> + VERIFY0(dmu_tx_assign(tx, TXG_WAIT));
> dsl_dataset_dirty(dmu_objset_ds(zilog->zl_os), tx);
> txg = dmu_tx_get_txg(tx);
>
> @@ -573,7 +661,7 @@ zil_create(zilog_t *zilog)
> }
>
> /*
> - * Allocate a log write buffer (lwb) for the first log block.
> + * Allocate a log write block (lwb) for the first log block.
> */
> if (error == 0)
> lwb = zil_alloc_lwb(zilog, &blk, slog, txg);
> @@ -594,13 +682,13 @@ zil_create(zilog_t *zilog)
> }
>
> /*
> - * In one tx, free all log blocks and clear the log header.
> - * If keep_first is set, then we're replaying a log with no content.
> - * We want to keep the first block, however, so that the first
> - * synchronous transaction doesn't require a txg_wait_synced()
> - * in zil_create(). We don't need to txg_wait_synced() here either
> - * when keep_first is set, because both zil_create() and zil_destroy()
> - * will wait for any in-progress destroys to complete.
> + * In one tx, free all log blocks and clear the log header. If keep_first
> + * is set, then we're replaying a log with no content. We want to keep the
> + * first block, however, so that the first synchronous transaction doesn't
> + * require a txg_wait_synced() in zil_create(). We don't need to
> + * txg_wait_synced() here either when keep_first is set, because both
> + * zil_create() and zil_destroy() will wait for any in-progress destroys
> + * to complete.
> */
> void
> zil_destroy(zilog_t *zilog, boolean_t keep_first)
> @@ -621,7 +709,7 @@ zil_destroy(zilog_t *zilog, boolean_t keep_first)
> return;
>
> tx = dmu_tx_create(zilog->zl_os);
> - VERIFY(dmu_tx_assign(tx, TXG_WAIT) == 0);
> + VERIFY0(dmu_tx_assign(tx, TXG_WAIT));
> dsl_dataset_dirty(dmu_objset_ds(zilog->zl_os), tx);
> txg = dmu_tx_get_txg(tx);
>
> @@ -638,8 +726,8 @@ zil_destroy(zilog_t *zilog, boolean_t keep_first)
> list_remove(&zilog->zl_lwb_list, lwb);
> if (lwb->lwb_buf != NULL)
> zio_buf_free(lwb->lwb_buf, lwb->lwb_sz);
> - zio_free_zil(zilog->zl_spa, txg, &lwb->lwb_blk);
> - kmem_cache_free(zil_lwb_cache, lwb);
> + zio_free(zilog->zl_spa, txg, &lwb->lwb_blk);
> + zil_free_lwb(zilog, lwb);
> }
> } else if (!keep_first) {
> zil_destroy_sync(zilog, tx);
> @@ -777,24 +865,64 @@ zil_check_log_chain(dsl_pool_t *dp, dsl_dataset_t *ds,
> return ((error == ECKSUM || error == ENOENT) ? 0 : error);
> }
>
> -static int
> -zil_vdev_compare(const void *x1, const void *x2)
> +/*
> + * When an itx is "skipped", this function is used to properly mark the
> + * waiter as "done, and signal any thread(s) waiting on it. An itx can
> + * be skipped (and not committed to an lwb) for a variety of reasons,
> + * one of them being that the itx was committed via spa_sync(), prior to
> + * it being committed to an lwb; this can happen if a thread calling
> + * zil_commit() is racing with spa_sync().
> + */
> +static void
> +zil_commit_waiter_skip(zil_commit_waiter_t *zcw)
> {
> - const uint64_t v1 = ((zil_vdev_node_t *)x1)->zv_vdev;
> - const uint64_t v2 = ((zil_vdev_node_t *)x2)->zv_vdev;
> + mutex_enter(&zcw->zcw_lock);
> + ASSERT3B(zcw->zcw_done, ==, B_FALSE);
> + zcw->zcw_done = B_TRUE;
> + cv_broadcast(&zcw->zcw_cv);
> + mutex_exit(&zcw->zcw_lock);
> +}
>
> - if (v1 < v2)
> - return (-1);
> - if (v1 > v2)
> - return (1);
> +/*
> + * This function is used when the given waiter is to be linked into an
> + * lwb's "lwb_waiter" list; i.e. when the itx is committed to the lwb.
> + * At this point, the waiter will no longer be referenced by the itx,
> + * and instead, will be referenced by the lwb.
> + */
> +static void
> +zil_commit_waiter_link_lwb(zil_commit_waiter_t *zcw, lwb_t *lwb)
> +{
> + mutex_enter(&zcw->zcw_lock);
> + ASSERT(!list_link_active(&zcw->zcw_node));
> + ASSERT3P(zcw->zcw_lwb, ==, NULL);
> + ASSERT3P(lwb, !=, NULL);
> + ASSERT(lwb->lwb_state == LWB_STATE_OPENED ||
> + lwb->lwb_state == LWB_STATE_ISSUED);
>
> - return (0);
> + list_insert_tail(&lwb->lwb_waiters, zcw);
> + zcw->zcw_lwb = lwb;
> + mutex_exit(&zcw->zcw_lock);
> }
>
> +/*
> + * This function is used when zio_alloc_zil() fails to allocate a ZIL
> + * block, and the given waiter must be linked to the "nolwb waiters"
> + * list inside of zil_process_commit_list().
> + */
> +static void
> +zil_commit_waiter_link_nolwb(zil_commit_waiter_t *zcw, list_t *nolwb)
> +{
> + mutex_enter(&zcw->zcw_lock);
> + ASSERT(!list_link_active(&zcw->zcw_node));
> + ASSERT3P(zcw->zcw_lwb, ==, NULL);
> + list_insert_tail(nolwb, zcw);
> + mutex_exit(&zcw->zcw_lock);
> +}
> +
> void
> -zil_add_block(zilog_t *zilog, const blkptr_t *bp)
> +zil_lwb_add_block(lwb_t *lwb, const blkptr_t *bp)
> {
> - avl_tree_t *t = &zilog->zl_vdev_tree;
> + avl_tree_t *t = &lwb->lwb_vdev_tree;
> avl_index_t where;
> zil_vdev_node_t *zv, zvsearch;
> int ndvas = BP_GET_NDVAS(bp);
> @@ -803,14 +931,7 @@ zil_add_block(zilog_t *zilog, const blkptr_t *bp)
> if (zfs_nocacheflush)
> return;
>
> - ASSERT(zilog->zl_writer);
> -
> - /*
> - * Even though we're zl_writer, we still need a lock because the
> - * zl_get_data() callbacks may have dmu_sync() done callbacks
> - * that will run concurrently.
> - */
> - mutex_enter(&zilog->zl_vdev_lock);
> + mutex_enter(&lwb->lwb_vdev_lock);
> for (i = 0; i < ndvas; i++) {
> zvsearch.zv_vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
> if (avl_find(t, &zvsearch, &where) == NULL) {
> @@ -819,59 +940,117 @@ zil_add_block(zilog_t *zilog, const blkptr_t *bp)
> avl_insert(t, zv, where);
> }
> }
> - mutex_exit(&zilog->zl_vdev_lock);
> + mutex_exit(&lwb->lwb_vdev_lock);
> }
>
> +void
> +zil_lwb_add_txg(lwb_t *lwb, uint64_t txg)
> +{
> + lwb->lwb_max_txg = MAX(lwb->lwb_max_txg, txg);
> +}
> +
> +/*
> + * This function is a called after all VDEVs associated with a given lwb
> + * write have completed their DKIOCFLUSHWRITECACHE command; or as soon
> + * as the lwb write completes, if "zfs_nocacheflush" is set.
> + *
> + * The intention is for this function to be called as soon as the
> + * contents of an lwb are considered "stable" on disk, and will survive
> + * any sudden loss of power. At this point, any threads waiting for the
> + * lwb to reach this state are signalled, and the "waiter" structures
> + * are marked "done".
> + */
> static void
> -zil_flush_vdevs(zilog_t *zilog)
> +zil_lwb_flush_vdevs_done(zio_t *zio)
> {
> - spa_t *spa = zilog->zl_spa;
> - avl_tree_t *t = &zilog->zl_vdev_tree;
> - void *cookie = NULL;
> - zil_vdev_node_t *zv;
> - zio_t *zio = NULL;
> + lwb_t *lwb = zio->io_private;
> + zilog_t *zilog = lwb->lwb_zilog;
> + dmu_tx_t *tx = lwb->lwb_tx;
> + zil_commit_waiter_t *zcw;
>
> - ASSERT(zilog->zl_writer);
> + spa_config_exit(zilog->zl_spa, SCL_STATE, lwb);
>
> + zio_buf_free(lwb->lwb_buf, lwb->lwb_sz);
> +
> + mutex_enter(&zilog->zl_lock);
> +
> /*
> - * We don't need zl_vdev_lock here because we're the zl_writer,
> - * and all zl_get_data() callbacks are done.
> + * Ensure the lwb buffer pointer is cleared before releasing the
> + * txg. If we have had an allocation failure and the txg is
> + * waiting to sync then we want zil_sync() to remove the lwb so
> + * that it's not picked up as the next new one in
> + * zil_process_commit_list(). zil_sync() will only remove the
> + * lwb if lwb_buf is null.
> */
> - if (avl_numnodes(t) == 0)
> - return;
> + lwb->lwb_buf = NULL;
> + lwb->lwb_tx = NULL;
>
> - spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
> + ASSERT3U(lwb->lwb_issued_timestamp, >, 0);
> + zilog->zl_last_lwb_latency = gethrtime() - lwb->lwb_issued_timestamp;
>
> - while ((zv = avl_destroy_nodes(t, &cookie)) != NULL) {
> - vdev_t *vd = vdev_lookup_top(spa, zv->zv_vdev);
> - if (vd != NULL && !vd->vdev_nowritecache) {
> - if (zio == NULL)
> - zio = zio_root(spa, NULL, NULL,
> ZIO_FLAG_CANFAIL);
> - zio_flush(zio, vd);
> - }
> - kmem_free(zv, sizeof (*zv));
> + lwb->lwb_root_zio = NULL;
> + lwb->lwb_state = LWB_STATE_DONE;
> +
> + if (zilog->zl_last_lwb_opened == lwb) {
> + /*
> + * Remember the highest committed log sequence number
> + * for ztest. We only update this value when all the log
> + * writes succeeded, because ztest wants to ASSERT that
> + * it got the whole log chain.
> + */
> + zilog->zl_commit_lr_seq = zilog->zl_lr_seq;
> }
>
> + while ((zcw = list_head(&lwb->lwb_waiters)) != NULL) {
> + mutex_enter(&zcw->zcw_lock);
> +
> + ASSERT(list_link_active(&zcw->zcw_node));
> + list_remove(&lwb->lwb_waiters, zcw);
> +
> + ASSERT3P(zcw->zcw_lwb, ==, lwb);
> + zcw->zcw_lwb = NULL;
> +
> + zcw->zcw_zio_error = zio->io_error;
> +
> + ASSERT3B(zcw->zcw_done, ==, B_FALSE);
> + zcw->zcw_done = B_TRUE;
> + cv_broadcast(&zcw->zcw_cv);
> +
> + mutex_exit(&zcw->zcw_lock);
> + }
> +
> + mutex_exit(&zilog->zl_lock);
> +
> /*
> - * Wait for all the flushes to complete. Not all devices actually
> - * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
> + * Now that we've written this log block, we have a stable pointer
> + * to the next block in the chain, so it's OK to let the txg in
> + * which we allocated the next block sync.
> */
> - if (zio)
> - (void) zio_wait(zio);
> -
> - spa_config_exit(spa, SCL_STATE, FTAG);
> + dmu_tx_commit(tx);
> }
>
> /*
> - * Function called when a log block write completes
> + * This is called when an lwb write completes. This means, this specific
> + * lwb was written to disk, and all dependent lwb have also been
> + * written to disk.
> + *
> + * At this point, a DKIOCFLUSHWRITECACHE command hasn't been issued to
> + * the VDEVs involved in writing out this specific lwb. The lwb will be
> + * "done" once zil_lwb_flush_vdevs_done() is called, which occurs in the
> + * zio completion callback for the lwb's root zio.
> */
> static void
> zil_lwb_write_done(zio_t *zio)
> {
> lwb_t *lwb = zio->io_private;
> + spa_t *spa = zio->io_spa;
> zilog_t *zilog = lwb->lwb_zilog;
> - dmu_tx_t *tx = lwb->lwb_tx;
> + avl_tree_t *t = &lwb->lwb_vdev_tree;
> + void *cookie = NULL;
> + zil_vdev_node_t *zv;
>
> + ASSERT3S(spa_config_held(spa, SCL_STATE, RW_READER), !=, 0);
> +
> ASSERT(BP_GET_COMPRESS(zio->io_bp) == ZIO_COMPRESS_OFF);
> ASSERT(BP_GET_TYPE(zio->io_bp) == DMU_OT_INTENT_LOG);
> ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
> @@ -880,58 +1059,115 @@ zil_lwb_write_done(zio_t *zio)
> ASSERT(!BP_IS_HOLE(zio->io_bp));
> ASSERT(BP_GET_FILL(zio->io_bp) == 0);
>
> - /*
> - * Ensure the lwb buffer pointer is cleared before releasing
> - * the txg. If we have had an allocation failure and
> - * the txg is waiting to sync then we want want zil_sync()
> - * to remove the lwb so that it's not picked up as the next new
> - * one in zil_commit_writer(). zil_sync() will only remove
> - * the lwb if lwb_buf is null.
> - */
> abd_put(zio->io_abd);
> - zio_buf_free(lwb->lwb_buf, lwb->lwb_sz);
> +
> + ASSERT3S(lwb->lwb_state, ==, LWB_STATE_ISSUED);
> +
> mutex_enter(&zilog->zl_lock);
> - lwb->lwb_buf = NULL;
> - lwb->lwb_tx = NULL;
> + lwb->lwb_write_zio = NULL;
> mutex_exit(&zilog->zl_lock);
>
> + if (avl_numnodes(t) == 0)
> + return;
> +
> /*
> - * Now that we've written this log block, we have a stable pointer
> - * to the next block in the chain, so it's OK to let the txg in
> - * which we allocated the next block sync.
> + * If there was an IO error, we're not going to call zio_flush()
> + * on these vdevs, so we simply empty the tree and free the
> + * nodes. We avoid calling zio_flush() since there isn't any
> + * good reason for doing so, after the lwb block failed to be
> + * written out.
> */
> - dmu_tx_commit(tx);
> + if (zio->io_error != 0) {
> + while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
> + kmem_free(zv, sizeof (*zv));
> + return;
> + }
> +
> + while ((zv = avl_destroy_nodes(t, &cookie)) != NULL) {
> + vdev_t *vd = vdev_lookup_top(spa, zv->zv_vdev);
> + if (vd != NULL)
> + zio_flush(lwb->lwb_root_zio, vd);
> + kmem_free(zv, sizeof (*zv));
> + }
> }
>
> /*
> - * Initialize the io for a log block.
> + * This function's purpose is to "open" an lwb such that it is ready to
> + * accept new itxs being committed to it. To do this, the lwb's zio
> + * structures are created, and linked to the lwb. This function is
> + * idempotent; if the passed in lwb has already been opened, this
> + * function is essentially a no-op.
> */
> static void
> -zil_lwb_write_init(zilog_t *zilog, lwb_t *lwb)
> +zil_lwb_write_open(zilog_t *zilog, lwb_t *lwb)
> {
> zbookmark_phys_t zb;
> zio_priority_t prio;
>
> + ASSERT(MUTEX_HELD(&zilog->zl_writer_lock));
> + ASSERT3P(lwb, !=, NULL);
> + EQUIV(lwb->lwb_root_zio == NULL, lwb->lwb_state == LWB_STATE_CLOSED);
> + EQUIV(lwb->lwb_root_zio != NULL, lwb->lwb_state == LWB_STATE_OPENED);
> +
> SET_BOOKMARK(&zb, lwb->lwb_blk.blk_cksum.zc_word[ZIL_ZC_OBJSET],
> ZB_ZIL_OBJECT, ZB_ZIL_LEVEL,
> lwb->lwb_blk.blk_cksum.zc_word[ZIL_ZC_SEQ]);
>
> - if (zilog->zl_root_zio == NULL) {
> - zilog->zl_root_zio = zio_root(zilog->zl_spa, NULL, NULL,
> - ZIO_FLAG_CANFAIL);
> - }
> - if (lwb->lwb_zio == NULL) {
> + if (lwb->lwb_root_zio == NULL) {
> abd_t *lwb_abd = abd_get_from_buf(lwb->lwb_buf,
> BP_GET_LSIZE(&lwb->lwb_blk));
> +
> if (!lwb->lwb_slog || zilog->zl_cur_used <= zil_slog_bulk)
> prio = ZIO_PRIORITY_SYNC_WRITE;
> else
> prio = ZIO_PRIORITY_ASYNC_WRITE;
> - lwb->lwb_zio = zio_rewrite(zilog->zl_root_zio, zilog->zl_spa,
> - 0, &lwb->lwb_blk, lwb_abd, BP_GET_LSIZE(&lwb->lwb_blk),
> - zil_lwb_write_done, lwb, prio,
> - ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE, &zb);
> +
> + lwb->lwb_root_zio = zio_root(zilog->zl_spa,
> + zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);
> + ASSERT3P(lwb->lwb_root_zio, !=, NULL);
> +
> + lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,
> + zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,
> + BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,
> + prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE, &zb);
> + ASSERT3P(lwb->lwb_write_zio, !=, NULL);
> +
> + lwb->lwb_state = LWB_STATE_OPENED;
> +
> + mutex_enter(&zilog->zl_lock);
> +
> + /*
> + * The zilog's "zl_last_lwb_opened" field is used to
> + * build the lwb/zio dependency chain, which is used to
> + * preserve the ordering of lwb completions that is
> + * required by the semantics of the ZIL. Each new lwb
> + * zio becomes a parent of the "previous" lwb zio, such
> + * that the new lwb's zio cannot complete until the
> + * "previous" lwb's zio completes.
> + *
> + * This is required by the semantics of zil_commit();
> + * the commit waiters attached to the lwbs will be woken
> + * in the lwb zio's completion callback, so this zio
> + * dependency graph ensures the waiters are woken in the
> + * correct order (the same order the lwbs were created).
> + */
> + lwb_t *last_lwb_opened = zilog->zl_last_lwb_opened;
> + if (last_lwb_opened != NULL &&
> + last_lwb_opened->lwb_state != LWB_STATE_DONE) {
> + ASSERT(last_lwb_opened->lwb_state ==
> LWB_STATE_OPENED ||
> + last_lwb_opened->lwb_state == LWB_STATE_ISSUED);
> + ASSERT3P(last_lwb_opened->lwb_root_zio, !=, NULL);
> + zio_add_child(lwb->lwb_root_zio,
> + last_lwb_opened->lwb_root_zio);
> + }
> + zilog->zl_last_lwb_opened = lwb;
> +
> + mutex_exit(&zilog->zl_lock);
> }
> +
> + ASSERT3P(lwb->lwb_root_zio, !=, NULL);
> + ASSERT3P(lwb->lwb_write_zio, !=, NULL);
> + ASSERT3S(lwb->lwb_state, ==, LWB_STATE_OPENED);
> }
>
> /*
> @@ -953,7 +1189,7 @@ uint64_t zil_block_buckets[] = {
> * Calls are serialized.
> */
> static lwb_t *
> -zil_lwb_write_start(zilog_t *zilog, lwb_t *lwb, boolean_t last)
> +zil_lwb_write_issue(zilog_t *zilog, lwb_t *lwb)
> {
> lwb_t *nlwb = NULL;
> zil_chain_t *zilc;
> @@ -965,6 +1201,11 @@ zil_lwb_write_start(zilog_t *zilog, lwb_t *lwb, boolea
> int i, error;
>
> *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
> _______________________________________________
> svn-src-head at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/svn-src-head
> To unsubscribe, send any mail to "svn-src-head-unsubscribe at freebsd.org"
Build world/kernel on r324015 fails due to:
[...]
===> lib/libpam/modules/pam_login_access (obj)
--- cddl/lib__L ---
--- zil.o ---
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:2333:19: warning:
implicit declaration of function 'cv_timedwait_sbt' is invalid in C99
[-Wimplicit-function-declaration] int wait_err = cv_timedwait_sbt(&zcw->zcw_cv,
^ /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:2335:8: error:
use of undeclared identifier 'C_ABSOLUTE' C_ABSOLUTE);
^
--- lib__L ---
--- obj_subdir_lib/libpam/modules/pam_nologin ---
===> lib/libpam/modules/pam_nologin (obj)
--- cddl/lib__L ---
1 warning and 1 error generated.
More information about the svn-src-head
mailing list