N-way mirror read speedup in zfsonlinux

Steven Hartland killing at multiplay.co.uk
Sun Aug 18 03:08:12 UTC 2013


Thanks for the feedback Justin, comments inline below.

----- Original Message ----- 
From: "Justin T. Gibbs" <gibbs at FreeBSD.org>
> First off, thanks for taking this on.  It looks like it will pay
> high dividends.
> 
> Here's my feedback:
> 
> 1) For the GEOM::rotating attribute, and struct disk's d_nonrotating
>   field, I think a rotational speed attribute would be more
>   generically useful.  For example, in Spectra's disk product, we
>   use the rotational speed to help group disks into related classes.
>   I would suggest using the SCSI/ATA conventions of 0 = unreported,
>   1 = non-rotating, N > 1 = reported RPM.

I like this, it provides more flexibility for choices morning forward
so I've changed from d_nonrotating -> d_rotation_rate.

> 2) do {} while() loops have their place, but only if they improve
>   code clarity and/or yield a significant performance benefit.  I
>   don't see either applying here.

I'd have to disagree, unless using it makes it difficult to understand,
a do while should be used is when you know there's always going to be
one iteration of the loop and hence no need to test before performing
the first iteration.

This is the case when calculating the load value, as a mirror must have
at least two vdevs.

The case for lowest_nr is always one comparison anyway so this should
be a for loop, changed.

> 3) Seeks do have a penalty for SSDs in the form of additional command
>   overhead.  If given the choice between SSDs of similar load, we
>   should prefer one with perfect locality in the hope that vdev_queue
>   will aggregate the I/O.  I have my eye on adding "unmapped ZIOs" for
>   use in vdev_queue so that we can avoid the data copy and aggregate
>   beyond the current 128k limit.

Sounds sensible, especially as it gives a more quantifiable reason for
choosing one non-rotating device over another, instead of relying on
random choice.

It does however produce a noticable drop in performance for the mixed
drive case, in my setup here 284MB/s down to 279MB/s, so I've documented
this in the comments.

I need to confirm it doesn't cause any generic performance drops, but
need to switch my test rig round to all SSD's to do that.

> 4) mm_replacing can be deceiving.  In a patch we have yet to upstream,
>   we actually renamed it to mm_resilvering and set it by interrogate
>   the dsl to see if a resilver task is active.  Alan Somer's comment in the
>   code describes part of the problem:
> 
> /*
> * If we are resilvering, then we should handle scrub reads
> * differently; we shouldn't issue them to the resilvering
> * device because it might not have those blocks.
> *
> * We are resilvering iff:
> * 1) We are a replacing vdev (ie our name is "replacing-1" or
> *    "spare-1" or something like that), and
> * 2) The pool is currently being resilvered.
> *
> * We cannot simply check vd->vdev_resilvering, because that
> * variable has never been correctly calculated.
> *
> * Nor can we just check our vdev_ops; there are cases (such as
> * when a user types "zpool replace pool odev spare_dev" and
> * spare_dev is in the spare list, or when a spare device is
> * automatically used to replace a DEGRADED device) when
> * resilvering is complete but both the original vdev and the
> * spare vdev remain in the pool.  That behavior is intentional.
> * It helps implement the policy that a spare should be
> * automatically removed from the pool after the user replaces
> * the device that originally failed.
> */
> 
>   Further complicating things is that when a scrub or resilver
>   operation is active, not all reads will be "scrub reads".  So
>   if we want to be 100% correct here, we want to select the least
>   loaded vdev that does not contain the block being read in its
>   DTL, unless the read is a SCRUB read and resilvering is not
>   active.

Removing mm_replacing checks resulted in zero measurable performance
difference in my tests, so given that plus your description I see no
real reason to keep it.

I tried making vdev_mirror_load return INT_MAX when
mm_vdev_resilver_txg != 0 (when resilvering ) but that actually
resulted in slightly worse performance in the resilvering case,
so noted in a comment and removed.

> 5) Even if we don't use load to determine the best device to use,
>   we still want to update the last offset issued so that future
>   calculations have correct information.

Good catch, changed.

> 6) Can't vdev_queue maintain vq_lastoffset on its own?  Having it
>   in vdev_queue also avoids any future bugs in updating the wrong
>   device (e.g. updating mm_preferred when mm_preferred isn't the
>   only device read from or isn't read from at all due to the state
>   of its DTL).

I actually tried this before but it produces significantly worse
results, 236MB/s instead of 286MB/s on my 3 disk setup. I suspect
its because doing it there means it happens too late for it to have
the full effect.

One of my initial attempts also used the tree queue directly instead
of storing a seperate value but that too produced poorer results.

> Attached are updated patches (only for files I changed *after*
> applying your patch) for my progress so far in trying to address
> 2-6.  I haven't done more than compile and boot test them so far,
> and it's way past my bedtime tonight.  But I will perform more
> testing tomorrow and perhaps try to more fully address #4 and tune
> the parameters for #3.

Updated patch attached.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zfs-mirror-load.patch
Type: application/octet-stream
Size: 16537 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/zfs-devel/attachments/20130818/eda72807/attachment.obj>


More information about the zfs-devel mailing list