Slow resilvering with mirrored ZIL

Fri Jul 5 08:02:04 UTC 2013

On 04.07.13 23:28, Jeremy Chadwick wrote:
>
> I'm not sure of the impact in situations like "I had a vdev made long
> ago (ashift 9), then I added a new vdev to the pool (ashift 12) and now
> ZFS is threatening to murder my children..."  :-)

Such an situation led me to spend few months to recreate/reshuffle some 
40TB of snapshots -- mostly because I was lazy to build an new system to 
copy to and the old one didn't have enough spare slots... To make things 
more interesting, I had made the ashift=9 vdev on 4k aligned drives and 
the ashift=12 vdev on 512b aligned drives...

Which brings up the question, whether it is possible to rollback this 
new vdev creation easily -- errors happen...

> But it would also be doing a TRIM of the LBA ranges associated with each
> partition, rather than the entire SSD.
>
> Meaning, in the example I gave (re: leaving untouched/unpartitioned
> space at the end of the drive for wear levelling), this would result in
> the untouched/unpartitioned space never being TRIM'd (by anything), thus
> the FTL map would still have references to those LBA ranges.  That'd be
> potentially 30% of LBA ranges in the FTL (depending on past I/O of
> course -- no way to know), and the only thing that would work that out
> is the SSD's GC (which is known to kill performance if/when it kicks
> in).

This assumes some knowledge of how SSD drives operate. Which might be 
true for one model/maker and not true for another.
No doubt, starting with an clean drive is best. That might be achieved 
by adding the entire drive to ZFS, then removing it -- a cheap way to 
get "Secure Erase" effect on FreeBSD. Then go on with partitioning...

> Hmm, that gives me an idea actually -- if gpart(8) itself had a flag to
> induce TRIM for the LBA range of whatever was just created (gpart
> create) or added (gpart add).  That way you could actually induce TRIM
> on those LBA ranges rather than rely on the FS to do it, or have to put
> faith into the SSD's GC (I rarely do :P).  In the OP's case he could
> then make a freebsd-zfs partition filling up the remaining 30% with the
> flag to TRIM it, then when that was done immediately delete the
> partition.  Hmm, not sure if what I'm saying makes sense or not, or if
> that's even a task/role gpart(8) should have...

Not a bad idea. Really. :)

>
>> ...
>>> Next topic...
>>>
>>> I would strongly recommend you not use 1 SSD for both log and cache.
>>> I understand your thought process here: "if the SSD dies, the log
>>> devices are mirrored so I'm okay, and the cache is throw-away anyway".
>> While not ideal it still gives a significant boost against no SLOG, so
>> if thats what HW you have to work with, don't discount the benefit it
>> will bring.
> Sure, the advantage of no seek times due to NAND plays a big role, but
> some of these drives don't particularly perform well when used with a
> larger I/O queue depth.

IF we talk about the SLOG, there are no seeks. The SLOG is written 
sequentially. You *can* use an spinning drive for SLOG and you *will* 
see noticeable performance boost in doing so.

The L2ARC on the other hand is especially designed to use no-seek SSDs, 
as it will do many small and scattered reads. Writes are still 
sequential, I believe..

> Now consider this: the Samsung 840 256GB (not the Pro) costs US$173
> and will give you 2x the performance of that Intel drive -- and more
> importantly, 12x the capacity (that means 30% for wear levelling is
> hardly a concern).  The 840 also performs significantly better at higher
> queue depths.  I'm just saying that for about US$40 more you get
> something that is by far better and will last you longer.  Low-capacity
> SSDs, even if SLC, are incredibly niche and I'm still not sure what
> demographic they're catering to.

The non-Pro 840 is hardly a match to any SLC SSD. Remember, SLC is all 
about endurance. It is order(s) of magnitude more enduring than the TLC 
flash used in that cheap consumer drive. IOPS and interface speed are 
different things -- that might not be of concern here.

Nevertheless, I have recently began to view SSDs in SLOG/L2ARC as 
consumables... however, no matter how I calculate, the enterprise drives 
always win by a big margin...

> I'm making a lot of assumptions about his I/O workload too, of course.
> I myself tend to stay away from cache/log devices for the time being
> given that my workloads don't necessitate them.  Persistent cache (yeah
> I know it's on the todo list) would interest me since the MCH on my
> board is maxed out at 8GB.

In short... be careful. :) Don't be tempted to add too large of an L2ARC 
with only 8GB of RAM. :)

Daniel