zfs_trim_enabled destroys zio_free() performance

Sun Sep 13 12:55:14 UTC 2015

On 11/09/2015 17:07, Matthew Ahrens wrote:
> I discovered that when destroying a ZFS snapshot, we can end up using
> several seconds of CPU via this stack trace:
>
>                kernel`spinlock_exit+0x2d
>                kernel`taskqueue_enqueue+0x12c
>                zfs.ko`zio_issue_async+0x7c
>                zfs.ko`zio_execute+0x162
>                zfs.ko`dsl_scan_free_block_cb+0x15f
>                zfs.ko`bpobj_iterate_impl+0x25d
>                zfs.ko`bpobj_iterate_impl+0x46e
>                zfs.ko`dsl_scan_sync+0x152
>                zfs.ko`spa_sync+0x5c1
>                zfs.ko`txg_sync_thread+0x3a6
>                kernel`fork_exit+0x9a
>                kernel`0xffffffff80d0acbe
>               6558 ms
>
> This is not good for performance since, in addition to the CPU cost, it
> doesn't allow the sync thread to do anything else, and this is observable
> as periods where we don't do any write i/o to disk for several seconds.
>
> The problem is that when zfs_trim_enabled is set (which it is by default),
> zio_free_sync() always sets ZIO_STAGE_ISSUE_ASYNC, causing the free to be
> dispatched to a taskq.  Since each task completes very quickly, there is a
> large locking and context switching overhead -- we would be better off just
> processing the free in the caller's context.
>
> I'm not sure exactly why we need to go async when trim is enabled, but it
> seems like at least we should not bother going async if trim is not
> actually being used (e.g. with an all-spinning-disk pool).  It would also
> be worth investigating not going async even when trim is useful (e.g. on
> SSD-based pools).
>
> Here is the relevant code:
>
> zio_free_sync():
>          if (zfs_trim_enabled)
>                  stage |= ZIO_STAGE_ISSUE_ASYNC | ZIO_STAGE_VDEV_IO_START |
>                      ZIO_STAGE_VDEV_IO_ASSESS;
>          /*
>           * GANG and DEDUP blocks can induce a read (for the gang block
> header,
>           * or the DDT), so issue them asynchronously so that this thread is
>           * not tied up.
>           */
>          else if (BP_IS_GANG(bp) || BP_GET_DEDUP(bp))
>                  stage |= ZIO_STAGE_ISSUE_ASYNC;
TRIM requests are queued, combined and only actioned after time in the 
TRIM thread as they are quite expensive which why I believe it was 
thought async was required, however given all this will do is trigger a 
call to trim_map_free for leaf vdev's which will be either:
1. A no-op if vdev_notrim is set (spinning rust)
2. An insert into the trim AVL

The processing of the zio should always be quick I don't see why we 
couldn't execute it sync.

I've set a test going on my head box removing ZIO_STAGE_ISSUE_ASYNC to 
see if I get any strange behaviour.

     Regards
     Steve