getting to 4K disk blocks in ZFS

Thu Sep 11 02:26:43 UTC 2014

> On Sep 10, 2014, at 12:46 AM, Aristedes Maniatis <ari at ish.com.au> wrote:
> 
> As we all know, it is important to ensure that modern disks are set up properly with the correct block size. Everything is good if all the disks and the pool are "ashift=9" (512 byte blocks). But as soon as one new drive requires 4k blocks, performance drops through the floor of the enture pool.
> 
> 
> In order to upgrade there appear to be two separate things that must be done for a ZFS pool.
> 
> 1. Create partitions on 4K boundaries. This is simple with the "-a 4k" option in gpart, and it isn't hard to remove disks one at a time from a pool, reformat them on the right boundaries and put them back. Hopefully you've left a few spare bytes on the disk to ensure that your partition doesn't get smaller when you reinsert it to the pool.
> 
> 2. Create a brand new pool which has ashift=12 and zfs send|receive all the data over.
> 
> 
> I guess I don't understand enough about zpool to know why the pool itself has a block size, since I understood ZFS to have variable stripe widths.
> 
> The problem with step 2 is that you need to have enough hard disks spare to create a whole new pool and throw away the old disks. Plus a disk controller with lots of spare ports. Plus the ability to take the system offline for hours or days while the migration happens.
> 
> One way to reduce this slightly is to create a new pool with reduced redundancy. For example, create a RAIDZ2 with two fake disks, then offline those disks.

Lots of good info in other responses, I just wanted to address this part of your message.

It should be a given that good backups are a requirement before you start any of this. _Especially_ if you have to destroy the old pool in order to provide redundancy for the new pool.

I have done this ashift conversion and it was a bit of a nail-biting experience as you've anticipated. The one suggestion I have for improving on the above is to use snapshots to minimize the downtime. Get an initial clone of the pool during off-peak hours (if any), then you only need to take the system down to send a "final" differential snapshot.

JN