N-way mirror read speedup in zfsonlinux

Sat Aug 17 07:22:30 UTC 2013

Hi Steve,

First off, thanks for taking this on.  It looks like it will pay
high dividends.

Here's my feedback:

1) For the GEOM::rotating attribute, and struct disk's d_nonrotating
   field, I think a rotational speed attribute would be more
   generically useful.  For example, in Spectra's disk product, we
   use the rotational speed to help group disks into related classes.
   I would suggest using the SCSI/ATA conventions of 0 = unreported,
   1 = non-rotating, N > 1 = reported RPM.

2) do {} while() loops have their place, but only if they improve
   code clarity and/or yield a significant performance benefit.  I
   don't see either applying here.

3) Seeks do have a penalty for SSDs in the form of additional command
   overhead.  If given the choice between SSDs of similar load, we
   should prefer one with perfect locality in the hope that vdev_queue
   will aggregate the I/O.  I have my eye on adding "unmapped ZIOs" for
   use in vdev_queue so that we can avoid the data copy and aggregate
   beyond the current 128k limit.

4) mm_replacing can be deceiving.  In a patch we have yet to upstream,
   we actually renamed it to mm_resilvering and set it by interrogate
   the dsl to see if a resilver task is active.  Alan Somer's comment in the
   code describes part of the problem:

	/*
	 * If we are resilvering, then we should handle scrub reads
	 * differently; we shouldn't issue them to the resilvering
	 * device because it might not have those blocks.
	 *
	 * We are resilvering iff:
	 * 1) We are a replacing vdev (ie our name is "replacing-1" or
	 *    "spare-1" or something like that), and
	 * 2) The pool is currently being resilvered.
	 *
	 * We cannot simply check vd->vdev_resilvering, because that
	 * variable has never been correctly calculated.
	 *
	 * Nor can we just check our vdev_ops; there are cases (such as
	 * when a user types "zpool replace pool odev spare_dev" and
	 * spare_dev is in the spare list, or when a spare device is
	 * automatically used to replace a DEGRADED device) when
	 * resilvering is complete but both the original vdev and the
	 * spare vdev remain in the pool.  That behavior is intentional.
	 * It helps implement the policy that a spare should be
	 * automatically removed from the pool after the user replaces
	 * the device that originally failed.
	 */

   Further complicating things is that when a scrub or resilver
   operation is active, not all reads will be "scrub reads".  So
   if we want to be 100% correct here, we want to select the least
   loaded vdev that does not contain the block being read in its
   DTL, unless the read is a SCRUB read and resilvering is not
   active.

5) Even if we don't use load to determine the best device to use,
   we still want to update the last offset issued so that future
   calculations have correct information.

6) Can't vdev_queue maintain vq_lastoffset on its own?  Having it
   in vdev_queue also avoids any future bugs in updating the wrong
   device (e.g. updating mm_preferred when mm_preferred isn't the
   only device read from or isn't read from at all due to the state
   of its DTL).

Attached are updated patches (only for files I changed *after*
applying your patch) for my progress so far in trying to address
2-6.  I haven't done more than compile and boot test them so far,
and it's way past my bedtime tonight.  But I will perform more
testing tomorrow and perhaps try to more fully address #4 and tune
the parameters for #3.

--
Justin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdev_mirror.diffs
Type: application/octet-stream
Size: 6586 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/zfs-devel/attachments/20130817/8dc6f016/attachment.obj>
-------------- next part --------------

On Aug 16, 2013, at 10:17 AM, Steven Hartland <killing at multiplay.co.uk> wrote:

> Baring any more feedback I intend to commit this soon, so please shout
> if there's anything outstanding ;-)
> 
> ----- Original Message ----- From: "Steven Hartland"
>> ----- Original Message ----- From: "Matthew Ahrens" <mahrens at delphix.com>
>>> The platform-independent code in the patch looks good to me with one
>>> exceptions:
>>> mm->mm_children = c--;
>>> 
>>> It's non-obvious why c should be decremented *here*, as opposed to as part
>>> of the loop logic.  If there's some trick going on with the loop then it's
>>> lost on me.
>> This just configures 'c' so the loop starts with the last child, could split
>> to a c--; on its own line if people prefer?
>>> Does it matter that we go through the children in reverse
>>> order?  Is there something wrong with the existing loop logic:
>>> for (c = 0; c < mm->mm_children; c++) {
>> Order is not important, but using a do{} while(); provides a small optimisation
>> which avoids the first loop test totally and then on subisquent iterations
>> compare against a constant not through a struct pointer.
> 
>   Regards
>   Steve
> 
> ================================================
> This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 
> In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
> or return the E.mail to postmaster at multiplay.co.uk.
> 
> _______________________________________________
> zfs-devel at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/zfs-devel
> To unsubscribe, send any mail to "zfs-devel-unsubscribe at freebsd.org"
>