gvinum raid5 vs. ZFS raidz

Tue Jul 29 16:01:46 UTC 2014

On Jul 29, 2014, at 4:27, Scott Bennett <bennett at sdf.org> wrote:

>     I want to set up a couple of software-based RAID devices across
> identically sized partitions on several disks.  At first I thought that
> gvinum's raid5 would be the way to go, but now that I have finally found
> and read some information about raidz, I am unsure which to choose.  My
> current, and possibly wrong, understanding about the two methods' most
> important features (to me, at least) can be summarized as follows.

Disclaimer, I have experience with ZFS but not your other alternative.

https://www.listbox.com/subscribe/?listname=zfs@lists.illumos.org

> 		raid5					raidz
> 
> Has parity checking, but any parity		Has parity checking *and*
> errors identified are assumed to be		frequently spaced checksums

ZFS checksums all data for errors. If there is redundancy (mirror, raid, copies > 1) ZFS will transparently repair damaged data (but increment the “checksum” error count so you can know via the zpool status command that you *are* hitting errors).

<snip>

> Can be expanded by the addition of more		Can only be expanded by
> spindles via a "gvinum grow" operation.		replacing all components with
> 						larger components.  The number

All ZFS devices are derived from what are called top level vdevs (virtual devices). The data is striped across all of the top level vdevs. Each vdev may be composed of a single drive. mirror, or raid (z1, z2, or z3). So you can create a mixed zpool (not recommended for a variety of reasons) with a different type of vdev for each vdev. The way to expand any ZFS zpool is to add additional vdevs (beyond the replace all drives in a single vdev and then grow to fill the new drives). So you can create a zpool with one raidz1 vdev and then later add a second raidz1 vdev. Or more commonly, start with a mirror vdev and then add a second, third, fourth (etc.) mirror vdev.

It is this two tier structure that is one of ZFSes strengths. It is also a feature that is not well understood.

<snip>

> Does not support migration to any other		Does not support migration
> RAID levels or their equivalents.		between raidz levels, even by

Correct. Once you have created a vdev, that vdev must remain the same type. You can add mirrors to a mirror vdev, but you cannot add drives or change raid level to raidz1, raidz2, or raidz3 vdevs.

<snip>

> Does not support additional parity		Supports one (raidz2) or two
> dimensions a la RAID6.				(raidz3) additional parity

ZFS parity is handled slightly differently than for traditional raid-5 (as well as the striping of data / parity blocks). So you cannot just count on loosing 1, 2, or 3 drives worth of space to parity. See Matt Ahren’s Blog entry here http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for (probably) more data on this than you want :-) And here https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674 is his spreadsheet that relates space lost due to parity to number of drives in raidz vdev and data block size (yes, the amount of space lost to parity caries with data block, not configured filesystem block size!). There is a separate tab for each of RAIDz1, RAIDz2, and RAIDz3.

<snip>

> Fast performance because each block		Slower performance because each
> is on a separate spindle from the		block is spread across all
> from the previous and next blocks.		spindles a la RAID3, so many
> 						simultaneous I/O operations are
> 						required for each block.

ZFS performance is never that simple as I/O is requested from the drive in parallel. Unless you are saturating the controller you should be able to keep all the drive busy at once. Also note that ZFS does NOT suffer the RAID-5 read-modify-write penalty on writes as every write is a new write to disk (there is no modification of existing disk blocks), this is referred to as being Copy On Write (COW).

> 				-----------------------
>     I hoped to start with a minimal number of components and eventually
> add more components to increase the space available in the raid5 or raidz
> devices.  Increasing their sizes that way would also increase the total
> percentage of space in the devices devoted to data rather than parity, as
> well as improving the performance enhancement of the striping.  For various
> reasons, having to replace all component spindles with larger-capacity
> components is not a viable method of increasing the size of the raid5 or
> raidz devices in my case.  That would appear to rule out raidz.

Yup.

>     OTOH, the very large-capacity drives available in the last two or
> three years appear not to be very reliable(*) compared to older drives of
> 1 TB or smaller capacities.  gvinum's raid5 appears not to offer good
> protection against, nor any repair of, damaged data blocks.

Yup. Unless you use ZFS plan on suffering silent data corruption due to the uncorrectable (and undetectable by the drive) error rate off of large drives. All drives suffer uncorrectable errors, read errors that the drive itself does not realize are errors. With traditional filesystems this bad data is returned to the OS and in some cases may cause a filesystem panic and in others just bad data returned to the application. This is one of the HUGE benefits of ZFS, it catches those errors.

<snip>

>   Thanks to three failed external drives and
> apparently not fully reliable replacements, compounded by a bad ports
> update two or three months ago, I have no functioning X11 and no space
> set up any longer in which to build ports to fix the X11 problem, so I
> really want to get the disk situation settled ASAP.  Trying to keep track
> of everything using only syscons and window(1) is wearing my patience
> awfully thin.

My home server is ZFS only and I have 2 drives mirrored for the OS and 5 drives in a raidz2 for data with one hot spare. I have suffered 3 drive failures (all Seagate), two of which took the four drives in my external enclosure offline (damn sata port multipliers). I have had NO data loss or corruption!

I started like you, wanting to have some drives and add more later. I started with a pair of 1TB drives mirrored, then added a second pair to double my capacity. The problem with 2-way mirrors is that the MTTDL (Mean Time To Data Loss) is much lower than with RAIDz2, with similar cost in spec for a 4 disk configuration. After I had a drive fail in the mirror configuration, I ordered a replacement and crossed my fingers that the other half to *that* mirror would not fail (the pairs of drives in the mirrors were the same make / model bought at the same time … not a good bet for reliability). When I got the replacement drive(s) I took some time and rebuilt my configuration to better handle growth and reliability by going from a 4 disk 2-way mirror configuration to a 5 disk RAIDz2. I went from net about 2TB to net about 3TB capacity and a hot spare.

If being able to easily grow capacity is the primary goal I would go with a 2-way mirror configuration and always include a hot spare (so that *when* a drive fails it immediately starts resilvering (the ZFS term for syncing) the vdev). Then you can simple add pairs of drives to add capacity. Just make sure that the hot spare is at least as large as the largest drive in use. When you buy drives, always buy from as many different manufacturers and models as you can. I just bought four 2TB drives for my backup server. One is a WD, the other 3 are HGST but they are four different model drives, so that they did not come off the same production line on the same week as each other. If I could have I would have gotten four different manufacturers. I also only buy server class (rated for 24x7 operation with 5 year warranty) drives. The additional cost has been offset by the savings due to being able to have a failed drive replaced under warranty.

> (*) [Last year I got two defective 3 TB drives in a row from Seagate.

Wow, the only time I have seen that kind of failure rate was buying from Newegg when they were packing them badly.

> I ended up settling for a 2 TB Seagate that is still running fine AFAIK.
> While that process was going on, I bought three 2 TB Seagate drives in
> external cases with USB 3.0 interfaces, two of which failed outright
> after about 12 months and have been replaced with two refurbished drives
> under warranty.

Yup, they all replace failed drives with refurb.

As a side note, on my home server I have had 6 Seagate ES.2 or ES.3 drives, 2 HGST UltraStar drives, and 2 WD RE4 in service on my home server. I have had 3 of the Seagates fail (and one of the Seagate replacements has failed, still under warranty). I have not had any HGST or WD drives fail (and they both have better performance than the Seagates). This does not mean that I do not buy Seagate drives. I spread my purchases around, keeping to the 24x7 5 year warranty drives and followup when I have a failure.

>  While waiting for those replacements to arrive, I bought
> a 2 TB Samsung drive in an external case with a USB 3.0 interface.  I
> discovered by chance that copying very large files to these drives is an
> error-prone process.

I would suspect the USB 3.0 layer problem, but that is just a guess.

>  A roughly 1.1 TB file on the one surviving external
> Seagate drive from last year's purchase of three, when copied to the
> Samsung drive, showed no I/O errors during the copy operation.  However,
> a comparison check using "cmp -l -z originalfile copyoforiginal" shows
> quite a few places where the contents don't match.

ZFS would not tolerate those kinds of errors. On reading the file ZFS would know via the checksum that the file was bad.

>  The same procedure
> applied to one of the refurbished Seagates gives similar results, although
> the locations and numbers of differing bytes are different from those on
> the Samsung drive.  The same procedure applied to the other refurbished
> drive resulted in a good copy the first time, but a later repetition ended
> up with a copied file that differed from the original by a single bit in
> each of two widely separated places in the files.  These problems have
> raised the priority of a self-healing RAID device in my mind.

Self healing RAID will be of little help… See more below

>     I have to say that these are new experiences to me.  The disk drives,
> controllers, etc. that I grew up with all had parity checking in the hardware,
> including the data encoded on the disks, so single-bit errors anywhere in
> the process showed up as hardware I/O errors instantly.  If the errors were
> not eliminated during a limited number of retries, they ended up as permanent
> I/O errors that a human would have to resolve at some point.

What controllers and drives? I have never seen a drive that does NOT have uncorrectable errors (these are undetectable by the drive). I have also never seen a controller that checksums the data. The controllers rely on the drive to report errors. If the drive does not report an error, then the controller trusts the data.

The big difference is that with drives under 1TB the odds of running into an uncorrectable error over the life of the drive is very, very small. The uncorrectable error rate does NOT scale down as the drives scale up. It has been stable at 1 in 10e-14 (for cheap drives) to 1 in 10e-15 (for good drives) for over the past 10 years (when I started looking at that drive spec). So if the rate is not changing and the total amount of data written / read over the life of the drive join up by, in some cases, orders of magnitude, the real world occurrence of such errors is increasing.

>     FWIW, I also discovered that I cannot run two such multi-hour-long
> copy operations in parallel using two separate pairs of drives.  Running
> them together seems to go okay for a while, but eventually always results
> in a panic.  This is on 9.2-STABLE (r264339).  I know that that is not up
> to date, but I can't do anything about that until my disk hardware situation
> is settled.]

I have had mixed luck with large copy operations via USB on Freebsd 9.x Under 9.1 I have found it to be completely unreliable. With 9.2 I have managed without too many errors. USB really does not seem to be a good transport for large quantities of data at fast rates. See my rant on USB hubs here: http://pk1048.com/usb-beware/

--
Paul Kraus
paul at kraus-haus.org