Slow resilvering with mirrored ZIL

Thu Jul 4 23:21:24 UTC 2013

----- Original Message ----- 
From: "Jeremy Chadwick" <jdc at koitsu.org>

> On Thu, Jul 04, 2013 at 08:41:18PM +0100, Steven Hartland wrote:
>> ----- Original Message ----- From: "Jeremy Chadwick"
>> <jdc at koitsu.org>
>> ...
>> >I believe -- but I need someone else to chime in here with confirmation,
>> >particularly someone who is familiar with ZFS's internals -- once your
>> >pool is ashift 12, you can do a disk replacement ***without*** having to
>> >do the gnop procedure (because the pool itself is already using ashift
>> >12).  But again, I need someone to confirm that.
>> 
>> Close, the ashift is a property of the vdev and not the entire pool so
>> if your adding a new vdev to the pool at least one of the devices in
>> said pool needs to report 4k sectors either natively or via the gnop
>> work around.
> 
> I just looked at zdb -C -- you're right.  I was visually parsing the
> ashift line/location as being part of the pool section, not the vdev
> section.  Thank you for correcting me.
> 
> And if I'm reading what you've said correctly, re: that only one device
> in the pool needs to have the gnop workaround for the vdev to end up
> using ashift=12, then that means "zpool replace" doesn't actually need
> the administrator to do the gnop trick on a disk replacement -- because
> the more I though about that, the more I realised that would open up a
> can of worms that would be painful to deal with if the pool was in use
> or the system could not be rebooted (I can explain this if asked).

Correct once the vdev is created its ashift is fixed so replaces 
don't need any gnop. Same with pool creation only of of the devices
in the vdev need to have gnop even for like a 6 disk RAIDZ

>> Note our ZFS code doesn't currently recognise FreeBSD 4K quirks, this is
>> something I have a patch for but want to enhance before committing.
> 
> This topic has come up before but I'll ask it again: is there some
> reason ashift still defaults to 9 / can't be increased to 12 ?

In my patch you can configure the default "desired" ashift aka minimum
and it defaults to 12. There was concern about the overhead this added
for none 4k disks especially when dealing with small files on none compressed
volumes. Which is why I'm still working on being able to easily override
ashift on creation.

> I'm not sure of the impact in situations like "I had a vdev made long
> ago (ashift 9), then I added a new vdev to the pool (ashift 12) and now
> ZFS is threatening to murder my children..."  :-)

I don't believe it cares but it could cause odd performance issue due to
the unbalanced nature of the pool, but if your adding a vdev to the pool
you'll already have that ;-)

>> >These SSDs need a full Secure Erase done to them.  In stable/9 you can
>> >do this through camcontrol, otherwise you need to use Linux (there are
>> >live CD/DVD distros that can do this for you) or the vendor's native
>> >utilities (in Windows usually).
>> 
>> When adding a new device to ZFS it will attempt to do a full TRIM so
>> this isn't 100% necessary but as some disks still get extra benefits
>> from this its still good if you want best performance.
> 
> Ah, I wasn't aware of that, thanks.  :-)
> 
> But it would also be doing a TRIM of the LBA ranges associated with each
> partition, rather than the entire SSD.
> 
> Meaning, in the example I gave (re: leaving untouched/unpartitioned
> space at the end of the drive for wear levelling), this would result in
> the untouched/unpartitioned space never being TRIM'd (by anything), thus
> the FTL map would still have references to those LBA ranges.  That'd be
> potentially 30% of LBA ranges in the FTL (depending on past I/O of
> course -- no way to know), and the only thing that would work that out
> is the SSD's GC (which is known to kill performance if/when it kicks
> in).

Correct which is one of the reasons a full secure erase is a good idea :)

> The situation would be different if the OP was using the entire SSD for
> ZFS (i.e. no partitioning), in that case yeah, a full TRIM would do the
> trick.

Yes and no depending on the disk, its been noted that TRIM even a full
disk TRIM doesn't result in the same preformance restoration as a secure
erase, another reason to still do a secure erase if you can.

> Overall though, Secure Erase is probably wiser in this situation given
> that it's a one-time deal before putting the partitioned SSDs into their
> roles.  He's using log devices, so once those are in place you gotta
> stick with 'em.
> 
> Hmm, that gives me an idea actually -- if gpart(8) itself had a flag to
> induce TRIM for the LBA range of whatever was just created (gpart
> create) or added (gpart add).  That way you could actually induce TRIM
> on those LBA ranges rather than rely on the FS to do it, or have to put
> faith into the SSD's GC (I rarely do :P).  In the OP's case he could
> then make a freebsd-zfs partition filling up the remaining 30% with the
> flag to TRIM it, then when that was done immediately delete the
> partition.  Hmm, not sure if what I'm saying makes sense or not, or if
> that's even a task/role gpart(8) should have...

You mean like the following PR, which is on my list for when I get some
free time:
http://www.freebsd.org/cgi/query-pr.cgi?pr=175943

>> ...
>> >Next topic...
>> >
>> >I would strongly recommend you not use 1 SSD for both log and cache.
>> >I understand your thought process here: "if the SSD dies, the log
>> >devices are mirrored so I'm okay, and the cache is throw-away anyway".
>> 
>> While not ideal it still gives a significant boost against no SLOG, so
>> if thats what HW you have to work with, don't discount the benefit it
>> will bring.
> 
> Sure, the advantage of no seek times due to NAND plays a big role, but
> some of these drives don't particularly perform well when used with a
> larger I/O queue depth.  The OCZ he has is okay, but the Intel drive --
> despite performing well for something of such low capacity (it's on par
> with that of the older X25-M G2 160GB drives) -- still has that capacity
> concern aspect, re: wear levelling needing 30% or so.  The drive costs
> US$130.
> 
> Now consider this: the Samsung 840 256GB (not the Pro) costs US$173
> and will give you 2x the performance of that Intel drive -- and more
> importantly, 12x the capacity (that means 30% for wear levelling is
> hardly a concern).  The 840 also performs significantly better at higher
> queue depths.  I'm just saying that for about US$40 more you get
> something that is by far better and will last you longer.  Low-capacity
> SSDs, even if SLC, are incredibly niche and I'm still not sure what
> demographic they're catering to.

Absolutely, also factor in that TRIM on Sandforce disks is very slow;
so much so that big deletes can easily become a significant performance
bottleneck, so TRIM isn't always the golden bullet so to speak.

> I'm making a lot of assumptions about his I/O workload too, of course.
> I myself tend to stay away from cache/log devices for the time being
> given that my workloads don't necessitate them.  Persistent cache (yeah
> I know it's on the todo list) would interest me since the MCH on my
> board is maxed out at 8GB.

To give a concrete example which may well be of use for others, we
had a mysql box here with dual 60GB SSD L2ARC's whic after continuous
increases in query write traffic we ended up with total IO saturation.

As a test we removed the L2ARC, partitioning them into a 10GB SLOG and
40GB L2ARC and the machine was utterly transformed, from constant 100%
disk IO to 10% as the SLOG's soaked up the sync transfers from mysql.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.