gvinum raid5 vs. ZFS raidz

Wed Aug 6 10:11:55 UTC 2014

On 06/08/2014 06:56, Scott Bennett wrote:
> Arthur Chance <freebsd at qeng-ho.org> wrote:
>> On 02/08/2014 11:25, Warren Block wrote:
>>> On Sat, 2 Aug 2014, Scott Bennett wrote:
>>>>      On Tue, 29 Jul 2014 12:01:36 -0400 Paul Kraus <paul at kraus-haus.org>
>>>
>>>>> ZFS parity is handled slightly differently than for traditional
>>>>> raid-5 (as well as the striping of data / parity blocks). So you
>>>>> cannot just count on loosing 1, 2, or 3 drives worth of space to
>>>>> parity. See Matt Ahren?s Blog entry here
>>>>> http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for
>>>>> (probably) more data on this than you want :-) And here
>>>>> https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674
>>>>> is his spreadsheet that relates space lost due to parity to number of
>>>>> drives in raidz vdev and data block size (yes, the amount of space
>>>>> lost to parity caries with data block, not configured filesystem
>>>>> block size!). There is a separate tab for each of RAIDz1, RAIDz2, and
>>>>> RAIDz3.
>>>>>
>>>> Anyway, using lynx(1), it is very hard to make any sense of the
>>>> spreadsheet.
>>>
>>> Even with a graphic browser, let's say that spreadsheet is not a paragon
>>> of clarity.  It's not clear what "block size in sectors" means in that
>>> context.  Filesystem blocks, presumably, but are sectors physical or
>>> virtual disk blocks, 512 or 4K?  What is that number when using a
>>> standard configuration of a disk with 4K sectors and ashift=12?  It
>>> could be 1, or 8, or maybe something else.
>>>
>>> As I read it, RAIDZ2 with five disks uses somewhere between 67% and 40%
>>> of the data space for redundancy.  The first seems unlikely, but I can't
>>> tell.  Better labels or rearrangement would help.
>>>
>>> A second chart with no labels at all follows the first.  It has only the
>>> power-of-two values in the "block size in sectors" column.  A
>>> restatement of the first one... but it's not clear why.
>>>
>>> My previous understanding was that RAIDZ2 with five disks would leave
>>> 60% of the capacity for data.
>>
>> Quite right. If you have N disks in a RAIDZx configuration, the fraction
>> used for data is (N-x)/N and the fraction for parity is x/N. There's
>> always overhead for the file system bookkeeping of course, but that's
>> not specific to ZFS or RAID.
>
>       I wonder if what varies is the amount of space taken up by the
> checksums.  If there's a checksum for each block, then the block size
> would change the fraction of the space lost to checksums, and the parity
> for the checksums would thus also change.  Enough to matter?  Maybe.

I'm not a file system guru, but my (high level) understanding is as 
follows. Corrections from anyone more knowledgeable welcome.

1. UFS and ZFS both use tree structures to represent files, with the 
data stored at the leaves and bookkeeping stored in the higher nodes. 
Therefore the overhead scales as the log of the data size, which is a 
negligible fraction for any sufficiently large amount of data.

2. UFS doesn't have data checksums, it relies purely on the hardware 
checksums. (This is the area I'm least certain of.)

3. ZFS keeps its checksums in a Merkel tree 
(http://en.wikipedia.org/wiki/Merkle_tree) so the checksums are held in 
the bookkeeping blocks, not in the data blocks. This simply changes the 
constant multiplier in front of the logarithm for the overhead. Also, I 
believe ZFS doesn't use fixed size data blocks, but aggregates writes 
into blocks of up to 128K.

Personally, I don't worry about the overheads of checksumming as the 
cost of the parity stripe(s) in raidz is dominant. It's a cost well 
worth paying though - I have a 3 disk raidz1 pool and a disk went bad 
within 3 months of building it (the manufacturer turned out to be having 
a few problems at the time) but I didn't lose a byte.