Slow resilvering with mirrored ZIL

Thu Jul 4 20:28:34 UTC 2013

On Thu, Jul 04, 2013 at 08:41:18PM +0100, Steven Hartland wrote:
> ----- Original Message ----- From: "Jeremy Chadwick"
> <jdc at koitsu.org>
> ...
> >I believe -- but I need someone else to chime in here with confirmation,
> >particularly someone who is familiar with ZFS's internals -- once your
> >pool is ashift 12, you can do a disk replacement ***without*** having to
> >do the gnop procedure (because the pool itself is already using ashift
> >12).  But again, I need someone to confirm that.
> 
> Close, the ashift is a property of the vdev and not the entire pool so
> if your adding a new vdev to the pool at least one of the devices in
> said pool needs to report 4k sectors either natively or via the gnop
> work around.

I just looked at zdb -C -- you're right.  I was visually parsing the
ashift line/location as being part of the pool section, not the vdev
section.  Thank you for correcting me.

And if I'm reading what you've said correctly, re: that only one device
in the pool needs to have the gnop workaround for the vdev to end up
using ashift=12, then that means "zpool replace" doesn't actually need
the administrator to do the gnop trick on a disk replacement -- because
the more I though about that, the more I realised that would open up a
can of worms that would be painful to deal with if the pool was in use
or the system could not be rebooted (I can explain this if asked).

> Note our ZFS code doesn't currently recognise FreeBSD 4K quirks, this is
> something I have a patch for but want to enhance before committing.

This topic has come up before but I'll ask it again: is there some
reason ashift still defaults to 9 / can't be increased to 12 ?

I'm not sure of the impact in situations like "I had a vdev made long
ago (ashift 9), then I added a new vdev to the pool (ashift 12) and now
ZFS is threatening to murder my children..."  :-)

> >Next topic.
> ...
> >Combine this fact with the fact that 9.1-RELEASE does not support TRIM
> >on ZFS, and you now have SSDs which are probably beat to hell and back.
> >
> >You really need to be running stable/9 if you want to use SSDs with ZFS.
> >I cannot stress this enough.  I will not bend on this fact.  I do not
> >care if what people have are SLC rather than MLC or TLC -- it doesn't
> >matter.  TRIM on ZFS is a downright necessity for long-term reliability
> >of an SSD.  Anyway...
> 
> stable/8 also has TRIM support too now.

Thanks -- didn't know that was MFC'd that far back.  And thank you for
your work on that, it's something I've been looking forward to for a
long time now as you know, and I really do appreciate it.

> >These SSDs need a full Secure Erase done to them.  In stable/9 you can
> >do this through camcontrol, otherwise you need to use Linux (there are
> >live CD/DVD distros that can do this for you) or the vendor's native
> >utilities (in Windows usually).
> 
> When adding a new device to ZFS it will attempt to do a full TRIM so
> this isn't 100% necessary but as some disks still get extra benefits
> from this its still good if you want best performance.

Ah, I wasn't aware of that, thanks.  :-)

But it would also be doing a TRIM of the LBA ranges associated with each
partition, rather than the entire SSD.

Meaning, in the example I gave (re: leaving untouched/unpartitioned
space at the end of the drive for wear levelling), this would result in
the untouched/unpartitioned space never being TRIM'd (by anything), thus
the FTL map would still have references to those LBA ranges.  That'd be
potentially 30% of LBA ranges in the FTL (depending on past I/O of
course -- no way to know), and the only thing that would work that out
is the SSD's GC (which is known to kill performance if/when it kicks
in).

The situation would be different if the OP was using the entire SSD for
ZFS (i.e. no partitioning), in that case yeah, a full TRIM would do the
trick.

Overall though, Secure Erase is probably wiser in this situation given
that it's a one-time deal before putting the partitioned SSDs into their
roles.  He's using log devices, so once those are in place you gotta
stick with 'em.

Hmm, that gives me an idea actually -- if gpart(8) itself had a flag to
induce TRIM for the LBA range of whatever was just created (gpart
create) or added (gpart add).  That way you could actually induce TRIM
on those LBA ranges rather than rely on the FS to do it, or have to put
faith into the SSD's GC (I rarely do :P).  In the OP's case he could
then make a freebsd-zfs partition filling up the remaining 30% with the
flag to TRIM it, then when that was done immediately delete the
partition.  Hmm, not sure if what I'm saying makes sense or not, or if
that's even a task/role gpart(8) should have...

> ...
> >Next topic...
> >
> >I would strongly recommend you not use 1 SSD for both log and cache.
> >I understand your thought process here: "if the SSD dies, the log
> >devices are mirrored so I'm okay, and the cache is throw-away anyway".
> 
> While not ideal it still gives a significant boost against no SLOG, so
> if thats what HW you have to work with, don't discount the benefit it
> will bring.

Sure, the advantage of no seek times due to NAND plays a big role, but
some of these drives don't particularly perform well when used with a
larger I/O queue depth.  The OCZ he has is okay, but the Intel drive --
despite performing well for something of such low capacity (it's on par
with that of the older X25-M G2 160GB drives) -- still has that capacity
concern aspect, re: wear levelling needing 30% or so.  The drive costs
US$130.

Now consider this: the Samsung 840 256GB (not the Pro) costs US$173
and will give you 2x the performance of that Intel drive -- and more
importantly, 12x the capacity (that means 30% for wear levelling is
hardly a concern).  The 840 also performs significantly better at higher
queue depths.  I'm just saying that for about US$40 more you get
something that is by far better and will last you longer.  Low-capacity
SSDs, even if SLC, are incredibly niche and I'm still not sure what
demographic they're catering to.

I'm making a lot of assumptions about his I/O workload too, of course.
I myself tend to stay away from cache/log devices for the time being
given that my workloads don't necessitate them.  Persistent cache (yeah
I know it's on the todo list) would interest me since the MCH on my
board is maxed out at 8GB.

> ...
> >>nas# smartctl -a ada3
> >>ada3: Unable to detect device type
> >
> >My fault -- the syntax here is wrong, I should have been more clear:
> >
> >smartctl -a /dev/ada{0,5}
> >
> >Also, please update your ports tree and install smartmontools 6.1.
> >There are improvements there pertaining to SSDs that are relevant.
> 
> Also don't forget to update the disk DB using update-smart-drivedb.

Yeah, that too.  The stock drivedb.h that comes with 6.1 should have
both his drive models correctly supported, if my memory serves me right.
I don't follow the drivedb.h commits (at one point I did).

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |