ZFS RAIDz space lost to parity WAS: raid5 vs. ZFS raidz

Wed Aug 6 23:30:55 UTC 2014

On Aug 6, 2014, at 1:56, Scott Bennett <bennett at sdf.org> wrote:

> Arthur Chance <freebsd at qeng-ho.org> wrote:

>> Quite right. If you have N disks in a RAIDZx configuration, the fraction 
>> used for data is (N-x)/N and the fraction for parity is x/N. There's 
>> always overhead for the file system bookkeeping of course, but that's 
>> not specific to ZFS or RAID.

But ZFS does NOT use fixed width stripes across the devices in the RAIDz<n> vdev. The stripe size changes based on number of devices and size of the write operation. ZFS adds parity and padding to make the data fit among the number of devices. 

>     I wonder if what varies is the amount of space taken up by the
> checksums.  If there's a checksum for each block, then the block size
> would change the fraction of the space lost to checksums, and the parity
> for the checksums would thus also change.  Enough to matter?  Maybe.

Nope, the size of checksum does NOT vary with vdev configuration.

Going back to Matt’s blog again (and I agree that his use of the term “n-sector block is confusing).

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

Read the blog, don’t just look at the charts :-) My summary is below and may help folks to better understand Matt’s text.

According to the blog (and I trust Matt in this regard), RAIDz does NOT calculate parity per stripe across devices, but on a write by write basis. Matt linked to a descriptive chart: http://blog.delphix.com/matt/files/2014/06/RAIDZ.png … The chart assumes a 5 device RAIDz1. Each color is a different write operation (remember that ZFS is a copy on write, so every write is a new write, no modifying existing data on disk).

The orange write consists of 8 data blocks and 2 parity blocks. Assuming 512B disk blocks, then you have 8KB of data and 1KB of parity. This is an 8KB write operation.

The yellow write is a 1.5KB write (3 data blocks) and 1 parity.

The green is the same as the yellow, just aligned differently.

Note that all columns (drives) are NOT involved in all write (and later read) operations.

The brown write is one data block (512B) and one parity.

The light purple write is 14 data blocks (7KB) and 4 parity.

Quoting directly form Matt:

A 11-sector block will use 1 parity + 4 data + 1 parity + 4 data + 1 parity + 3 data (e.g. the blue block in rows 9-12). Note that if there are several blocks sharing what would traditionally be thought of as a single “stripe”, there will be multiple parity blocks in the “stripe”.

RAID-Z also requires that each allocation be a multiple of (p+1), so that when it is freed it does not leave a free segment which is too small to be used (i.e. too small to fit even a single sector of data plus p parity sectors – e.g. the light blue block at left in rows 8-9 with 1 parity + 2 data + 1 padding). Therefore, RAID-Z requires a bit more space for parity and overhead than RAID-4/5/6.

This leads to the spreadsheet: https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674

The column down the left is filesystem block size in disk sectors (512B sectors), so it goes from 0.5KB to 128KB filesystem block size (recordsize is max you set when you tune the zfs dataset, zfs can and will write less than full records).

The column across the top is number of devices in the RAIDz1 vdev (see other sheets in the workbook for RAIDz2 and RAIDz3).

Keep in mind that the left column is also the size of the data you are writing. If you are using a database with an 8KB recordsize (16 disk sectors) and you have 6 devices per vdev, then you will loose 20% of the raw space to parity (plus additional for checksums and metadata). The chart further down (rows 29 through 37) show the same data but just for the powers of 2 increments.

So, as Matt says, the more devices you add to a RAID vdev, the more net capacity you will have. At the expense of performance. Quoting Matt’s opening:

TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the amount of space you are willing to devote to parity information. If you need more IOPS, use fewer disks per stripe. If you need more usable space, use more disks per stripe. Trying to optimize your RAID-Z stripe width based on exact numbers is irrelevant in nearly all cases.

and his summary at the end:

The strongest valid recommendation based on exact fitting of blocks into stripes is the following: If you are using RAID-Z with 512-byte sector devices with recordsize=4K or 8K and compression=off (but you probably want compression=lz4): use at least 5 disks with RAIDZ1; use at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3.

Note that you would ONLY use recordsize = 4KB or 8KB if you knew that your workload was ONLY 4 or 8 KB blocks of data (a database).

and finally:

To summarize: Use RAID-Z. Not too wide. Enable compression.

--
Paul Kraus
paul at kraus-haus.org