Slow resilvering with mirrored ZIL

Fri Jul 5 07:43:37 UTC 2013

On 04.07.13 22:12, Jeremy Chadwick wrote:
> I believe -- but I need someone else to chime in here with confirmation,
> particularly someone who is familiar with ZFS's internals -- once your
> pool is ashift 12, you can do a disk replacement ***without*** having to
> do the gnop procedure (because the pool itself is already using ashift
> 12).  But again, I need someone to confirm that.

I do not in any way claim to know well the ZFS internals, but can 
confirm this:

Once you have an ZFS vdev 4k aligned (ashift=12), you can replace drives 
in there and the vdev will stay 4k aligned. In ZFS, the alignment is 
per-vdev, not per device, not per zpool. When creating a new vdev, ZFS 
looks for the largest sector size as the underlying storage supports it 
and uses it. This is why you only need to apply the gnop trick to just 
one of the drives. Once the vdev is created, it pretty much does not 
care what the underlying storage reports.

> On these drives there are ways to work around this issue -- it
> specifically involves disabling drive-level APM.  To do so, you have to
> initiate a specific ATA CDB to the drive using "camcontrol cmd", and
> this has to be done every time the system reboots.  There is one
> drawback to disabling APM as well: the drives run hotter.

There is a way to do this with smartmontools as well, either with 
smartctl or smartd (which is wise thing to run anyway). Look for the -g 
option and the apm sub-option. Sometimes, for example when you have ATA 
devices connected trough SAS backplanes and HBAs you can't send them 
these commands via camcontrol.

> These SSDs need a full Secure Erase done to them.  In stable/9 you can
> do this through camcontrol, otherwise you need to use Linux (there are
> live CD/DVD distros that can do this for you) or the vendor's native
> utilities (in Windows usually).

ZFS in stable/9 actually does full TRIM when you attach a new device, 
which can be observed/confirmed via the TRIM statistics counters. You 
don't need to use any external utilities.

> UNDERSTAND: THIS IS NOT THE SAME AS A "DISK FORMAT" OR "ZEROING THE
> DISK".  In fact, dd if=/dev/zero to zero an SSD would be the worst
> possible thing you could do to it.  Secure Erase clears the entire FTL
> and resets the wear levelling matrix (that's just what I call it) back
> to factory defaults, so you end up with out-of-the-box performance:
> there's no more LBA-to-NAND-cell map entries in the FTL (which are
> usually what are responsible for slowdown).

I do not believe Secure Erase does what you propose. It more or less 
just does full device TRIM. Resetting things to factory defaults won't 
make any vendor happy, because they base their SSD warranties on the 
wear level. Anyway, if you know of a way to trick this, I am all ears :)

> Your Intel drive is very very small, and in fact I wouldn't even bother
> to use this drive -- it means you'd only be able to use roughly 14GB of
> it (at most) for data, and leave the remaining 6GB unallocated/unused
> solely for wear levelling.

An small SLC FLASH based drive might be worth more than a large MLC 
based drive... Just saying.
The SLOG rarely fills the drive and if you use TRIM, you should be safe.

> What you're not taking into consideration is how log and cache devices
> bottleneck ZFS, in addition to the fact that SATA is not like SAS when
> it comes to simultaneous R/W.  That poor OCZ drive...

With proper setup, there is really no bottleneck. For the cache device, 
it is advisable to set

vfs.zfs.l2arc_norw=0

As otherwise data will not be read from the L2ARC while something is 
written there. This is problematic with metadata, for other data, you 
just don't get the performance you could form having a SSD.

For mixing SLOG and L2ARC.. I always think it is a bad idea to do so. 
There are two reasons to have SLOG:

1. To reduce latency. By combinning SLOG and L2ARC on the same device 
you might not have enough IOPS in order to have low latency and consumer 
grade SSDs tend to not have consistent latency anyway. Some newer drives 
are promising, for example the OCZ Vector, or better yet the Intel 
S3500/S3700.

2. To reduce ZFS pool fragmentation. This is very important and often 
very much overlooked by everyone. If you want ZFS to perform well, you 
are better to have separate LOG device even if it is on an rotating disk 
(you only lose the low latency!). ZFS pool fragmentation might be a 
problem for long lived pools.

Mirroring the SLOG is just a safeguard, for not losing the last few 
seconds of writing really important data. But, if you could afford it, 
just do it.

Considering the small size of this pool however, I do not believe using 
one SSD for both SLOG and L2ARC might be serious bottleneck, unless 
real-life observation says otherwise.

Daniel