AF (4096 byte sector) drives: Can you mix/match in a ZFS pool?

Wed Oct 12 18:38:14 UTC 2011

On Oct 12, 2011, at 20:29 , Jeremy Chadwick wrote:

>> The gnop trick is used not because you will ask a 512-byte sector
>> drive to write 8 sectors with one I/O, but because you may ask an
>> 4096-byte sector drive to write only 512 bytes -- which for the
>> drive means it has to read 4096 bytes, modify 512 of these bytes and
>> write back 4096 bytes.
> 
> If I'm reading this correctly, you're effectively stating ashift
> actually just defines (or helps in calculating) an LBA offset for the
> start of the pool-related data on that device?  "ashift" seems like a
> badly-named term/variable for what this does, but oh well.

ashift defines the minimum block size of the vdev. The choice is fine, I believe as it describes how one get's a power of 2 size (by shifting 1 that number of times) :-)

>> The proper way to handle this is to create your zpool with 4096-byte
>> alignment, that is, for the time being by using the above gnop
>> 'hack'.
> 
> ...which brings into question why this is needed at all, meaning, why
> the ZFS code cannot be changed to default to an ashift value that's
> calculated as 12 (or equivalent) regardless of 512-byte or 4096-byte
> sector drives.

Currently the ZFS block size is 512 bytes to 128 kilobytes. That is with ashift of 9. If you have shift of 12, that effectively means minimum block size of 4k and maximum block size of 128k.

> How was this addressed on Solaris/OpenSolaris?
> 

I don't think they do.

>> There should be no implications to having one vdev with 512 byte
>> alignment and another with 4096 byte alignment. ZFS is smart enough
>> to issue minimum of 512 byte writes to the former and 4096 bytes to
>> the latter thus not creating any bottleneck.
> 
> How does ZFS determine this?  I was under the impression that this
> behaviour was determined by (or "assisted by") shift.

ZFS has a piece of data, say 20 kbyte block to write. If you have say 4 vdevs, one with shift=9 (512 bytes), another with ashift=12 (4096 bytes). All other issues ignored (equal size vdev's, full at the same capacity etc.) it has to write minimum of 9kb (512+512+4096+4096) -- apparently ZFS wants to fill all vdevs equally, so it will likely issue one 4k to vdev1, one 4k to vdev2, two 512b to vdev3 and two 512b to vdev4. 

If for example, it had 16k to write, it would write one 4k I/O to the 4k vdev's and 4 x 512b I/O (or a single write of 4k, depending on layering abstraction) to the 512b vdevs.

So yes, it is assisted by shift.

But, for the time being you need to assist ZFS how to create the vdev's with the proper shift value. This is because today's 4k drives lie that their geometry is 512b. As mentioned, there are patches for FreeBSD to 'discover' this behavior. Another approach is via gnop. Only at vdev creation time. Haven't seen anything like this for Solaris.

Daniel