getting to 4K disk blocks in ZFS

Johan Hendriks joh.hendriks at gmail.com
Thu Sep 11 07:12:10 UTC 2014


Op 10-09-14 om 09:48 schreef Stefan Esser:
> Am 10.09.2014 um 08:46 schrieb Aristedes Maniatis:
>> As we all know, it is important to ensure that modern disks are set
>> up properly with the correct block size. Everything is good if all
>> the disks and the pool are "ashift=9" (512 byte blocks). But as soon
>> as one new drive requires 4k blocks, performance drops through the
>> floor of the enture pool.
>>
>>
>> In order to upgrade there appear to be two separate things that must
>> be done for a ZFS pool.
>>
>> 1. Create partitions on 4K boundaries. This is simple with the
>> "-a 4k" option in gpart, and it isn't hard to remove disks one at a
>> time from a pool, reformat them on the right boundaries and put them
>> back. Hopefully you've left a few spare bytes on the disk to ensure
>> that your partition doesn't get smaller when you reinsert it to the
>> pool.
>>
>> 2. Create a brand new pool which has ashift=12 and zfs send|receive
>> all the data over.
>>
>>
>> I guess I don't understand enough about zpool to know why the pool
>> itself has a block size, since I understood ZFS to have variable
>> stripe widths.
> I'm not a ZFS internals expert, just a long time user, but I'll try to
> answer your questions.
>
> ZFS is based on a copy-on-write paradigm, which ensures, that no data
> is ever overwritten in place. All writes go to new blank blocks, and
> only after the last reference to an "old" block is lost (when no TXG
> or snapshot has references to it), is the old block freed and put back
> on the free block map.
>
> ZFS uses variable block sizes by breaking down large blocks to smaller
> fragments as suitable for the data to be stored. The largest block to
> be used is configurable (128 KByte by default) and the smallest fragment
> is the sector size (i.e. 512 or 4096 bytes), as configured by "ashift".
>
> The problem with 4K sector disks that report 512 byte sectors is, that
> ZFS still assumes, that no data is overwritten in place, while the disk
> drive does it behind the curtains. ZFS thinks it can atomically write
> 512 bytes, but the drive reads 4K, places the 512 bytes of data within
> that 4K physical sector in the drive's cache, and then writes back the
> 4K of data in one go.
>
> The cost is not only the latency of this read-modify-write sequence,
> but also that an elementary ZFS assumption is violated: Data that is
> in other (logical) 512 byte sectors of the physical 4 KByte sector
> can be lost, if that write operation fails, resulting in loss of data
> in those files that happen to share the physical sector with the one
> that received the write operation.
>
> This may never hit you, but ZFS is built on the assumption, that it
> cannot happen at all, which is no longer true with 4KB drives that
> are used with ashift=9.
>
>> The problem with step 2 is that you need to have enough hard disks
>> spare to create a whole new pool and throw away the old disks. Plus
>> a disk controller with lots of spare ports. Plus the ability to take
>> the system offline for hours or days while the migration happens.
>>
>> One way to reduce this slightly is to create a new pool with reduced
>> redundancy. For example, create a RAIDZ2 with two fake disks, then
>> off-line those disks.
> Both methods are dangerous! Studies have found, that the risk of
> another disk failure during resilvering is substantial. That was
> the reason for higher RAIDZ redundancy groups (raidz2, raidz3).
>
> With 1) you have to copy the data multiple times and the load
> could lead to loss of one of the source drives (and since you
> are in the process of overwriting the drive that provided
> redundancy, you loose your pool that way).
>
> The copying to a degraded pool that you describe in 2) is a
> possibility (and I've done it, once). You should make sure, that
> all source data is still available until a successful resilvering
> of the "new" pool with the fake disks replaced. You could do this
> by moving the redundant disks from the old pool the new pool (i.e.
> degrading the old pool, after all data has been copied, to use the
> redundant drives to complete the new pool). But this assumes, that
> the technologies of the drives match - I'll soon go from 4*2TB to
> 3*4TB (raidz1 in both cases), since I had 2 of the 2TB drives fail
> over the course of last year (replaced under warranty).
>
>> So, given how much this problem sucks (it is extremely easy to add
>> a 4K disk by mistake as a replacement for a failed disk), and how
>> painful the workaround is... will ZFS ever gain the ability to change
>> block size for the pool? Or is this so deep in the internals of ZFS
>> it is as likely as being able to dynamically add disks to an existing
>> zvol in the "never going to happen" basket?
> You can add a 4 KB physical drive that emulates 512 byte sectors
> (nearly all drives do) to an ashift=9 ZFS pool, but performance
> will suffer and you'll be violating a ZFS assumption as explained
> above.
>
>> And secondly, is it also bad to have ashift 9 disks inside a ashift
>> 12 pool? That is, do we need to replace all our disks in one go and
>> forever keep big sticky labels on each disk so we never mix them?
> The ashift parameter is per pool, not per disk. You can have a
> drive with emulated 512 byte sectors in an ashift=9 pool, but
> you cannot change the ashift value of a pool after creation.
The ashift parameter is per vdev, not per pool.


> Regards, STefan
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"



More information about the freebsd-stable mailing list