getting to 4K disk blocks in ZFS

Wed Sep 10 07:49:00 UTC 2014

Am 10.09.2014 um 08:46 schrieb Aristedes Maniatis:
> As we all know, it is important to ensure that modern disks are set
> up properly with the correct block size. Everything is good if all
> the disks and the pool are "ashift=9" (512 byte blocks). But as soon
> as one new drive requires 4k blocks, performance drops through the
> floor of the enture pool.
>
>
> In order to upgrade there appear to be two separate things that must
> be done for a ZFS pool.
>
> 1. Create partitions on 4K boundaries. This is simple with the
> "-a 4k" option in gpart, and it isn't hard to remove disks one at a
> time from a pool, reformat them on the right boundaries and put them
> back. Hopefully you've left a few spare bytes on the disk to ensure
> that your partition doesn't get smaller when you reinsert it to the
> pool.
>
> 2. Create a brand new pool which has ashift=12 and zfs send|receive
> all the data over.
>
>
> I guess I don't understand enough about zpool to know why the pool
> itself has a block size, since I understood ZFS to have variable
> stripe widths.

I'm not a ZFS internals expert, just a long time user, but I'll try to
answer your questions.

ZFS is based on a copy-on-write paradigm, which ensures, that no data
is ever overwritten in place. All writes go to new blank blocks, and
only after the last reference to an "old" block is lost (when no TXG
or snapshot has references to it), is the old block freed and put back
on the free block map.

ZFS uses variable block sizes by breaking down large blocks to smaller
fragments as suitable for the data to be stored. The largest block to
be used is configurable (128 KByte by default) and the smallest fragment
is the sector size (i.e. 512 or 4096 bytes), as configured by "ashift".

The problem with 4K sector disks that report 512 byte sectors is, that
ZFS still assumes, that no data is overwritten in place, while the disk
drive does it behind the curtains. ZFS thinks it can atomically write
512 bytes, but the drive reads 4K, places the 512 bytes of data within
that 4K physical sector in the drive's cache, and then writes back the
4K of data in one go.

The cost is not only the latency of this read-modify-write sequence,
but also that an elementary ZFS assumption is violated: Data that is
in other (logical) 512 byte sectors of the physical 4 KByte sector
can be lost, if that write operation fails, resulting in loss of data
in those files that happen to share the physical sector with the one
that received the write operation.

This may never hit you, but ZFS is built on the assumption, that it
cannot happen at all, which is no longer true with 4KB drives that
are used with ashift=9.

> The problem with step 2 is that you need to have enough hard disks
> spare to create a whole new pool and throw away the old disks. Plus
> a disk controller with lots of spare ports. Plus the ability to take
> the system offline for hours or days while the migration happens.
>
> One way to reduce this slightly is to create a new pool with reduced
> redundancy. For example, create a RAIDZ2 with two fake disks, then
> off-line those disks.

Both methods are dangerous! Studies have found, that the risk of
another disk failure during resilvering is substantial. That was
the reason for higher RAIDZ redundancy groups (raidz2, raidz3).

With 1) you have to copy the data multiple times and the load
could lead to loss of one of the source drives (and since you
are in the process of overwriting the drive that provided
redundancy, you loose your pool that way).

The copying to a degraded pool that you describe in 2) is a
possibility (and I've done it, once). You should make sure, that
all source data is still available until a successful resilvering
of the "new" pool with the fake disks replaced. You could do this
by moving the redundant disks from the old pool the new pool (i.e.
degrading the old pool, after all data has been copied, to use the
redundant drives to complete the new pool). But this assumes, that
the technologies of the drives match - I'll soon go from 4*2TB to
3*4TB (raidz1 in both cases), since I had 2 of the 2TB drives fail
over the course of last year (replaced under warranty).

> So, given how much this problem sucks (it is extremely easy to add
> a 4K disk by mistake as a replacement for a failed disk), and how
> painful the workaround is... will ZFS ever gain the ability to change
> block size for the pool? Or is this so deep in the internals of ZFS
> it is as likely as being able to dynamically add disks to an existing
> zvol in the "never going to happen" basket?

You can add a 4 KB physical drive that emulates 512 byte sectors
(nearly all drives do) to an ashift=9 ZFS pool, but performance
will suffer and you'll be violating a ZFS assumption as explained
above.

> And secondly, is it also bad to have ashift 9 disks inside a ashift
> 12 pool? That is, do we need to replace all our disks in one go and
> forever keep big sticky labels on each disk so we never mix them?

The ashift parameter is per pool, not per disk. You can have a
drive with emulated 512 byte sectors in an ashift=9 pool, but
you cannot change the ashift value of a pool after creation.

Regards, STefan