gvinum raid5 vs. ZFS raidz

Sat Aug 2 06:22:19 UTC 2014

     On Tue, 29 Jul 2014 12:01:36 -0400 Paul Kraus <paul at kraus-haus.org>
wrote:
>On Jul 29, 2014, at 4:27, Scott Bennett <bennett at sdf.org> wrote:
>
>>     I want to set up a couple of software-based RAID devices across
>> identically sized partitions on several disks.  At first I thought that
>> gvinum's raid5 would be the way to go, but now that I have finally found
>> and read some information about raidz, I am unsure which to choose.  My
>> current, and possibly wrong, understanding about the two methods' most
>> important features (to me, at least) can be summarized as follows.
>
>Disclaimer, I have experience with ZFS but not your other alternative.

     Okay, I appreciate the ZFS info anyway.  Maybe someone with gvinum
experience will weigh in at some point.
>
>https://www.listbox.com/subscribe/?listname=zfs@lists.illumos.org

     Thanks.  I'll check into it.
>
>> 		raid5					raidz
>> 
>> Has parity checking, but any parity		Has parity checking *and*
>> errors identified are assumed to be		frequently spaced checksums
>
>ZFS checksums all data for errors. If there is redundancy (mirror, raid, copies > 1) ZFS will transparently repair damaged data (but increment the ?checksum? error count so you can know via the zpool status command that you *are* hitting errors).
>
><snip>
>
>> Can be expanded by the addition of more		Can only be expanded by
>> spindles via a "gvinum grow" operation.		replacing all components with
>> 						larger components.  The number
>
>All ZFS devices are derived from what are called top level vdevs (virtual devices). The data is striped across all of the top level vdevs. Each vdev may be composed of a single drive. mirror, or raid (z1, z2, or z3). So you can create a mixed zpool (not recommended for a variety of reasons) with a different type of vdev for each vdev. The way to expand any ZFS zpool is to add additional vdevs (beyond the replace all drives in a single vdev and then grow to fill the new drives). So you can create a zpool with one raidz1 vdev and then later add a second raidz1 vdev. Or more commonly, start with a mirror vdev and then add a second, third, fourth (etc.) mirror vdev.

     [Ouch.  Trying to edit a response into entire paragraphs on single lines
is a drag.]
>
>It is this two tier structure that is one of ZFSes strengths. It is also a feature that is not well understood.
>
     I understood that, but apparently I didn't express it well enough
in my comparison table.  Thanks, though, for the confirmation of what
I wrote.  GEOM devices can be built upon other GEOM devices, too, as
can gvinum devices within some constraints.

><snip>
>
>> Does not support migration to any other		Does not support migration
>> RAID levels or their equivalents.		between raidz levels, even by
>
>Correct. Once you have created a vdev, that vdev must remain the same type. You can add mirrors to a mirror vdev, but you cannot add drives or change raid level to raidz1, raidz2, or raidz3 vdevs.

     Too bad.  Increasing the raidz level ought to be not much more
difficult than growing the raidz device by adding more spindles.  Doing
the latter ought to be no more difficult that doing it with gvinum's
stripe or raid5 devices.  Perhaps the ZFS developers will eventually
implement these capabilities.  (A side thought:  gstripe and graid3
devices ought also to be expandable in this manner, although the resulting
number of graid3 components would still need to be 2^n + 1.)
>
><snip>
>
>> Does not support additional parity		Supports one (raidz2) or two
>> dimensions a la RAID6.				(raidz3) additional parity
>
>ZFS parity is handled slightly differently than for traditional raid-5 (as well as the striping of data / parity blocks). So you cannot just count on loosing 1, 2, or 3 drives worth of space to parity. See Matt Ahren?s Blog entry here http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for (probably) more data on this than you want :-) And here https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674 is his spreadsheet that relates space lost due to parity to number of drives in raidz vdev and data block size (yes, the amount of space lost to parity caries with data block, not configured filesystem block size!). There is a separate tab for each of RAIDz1, RAIDz2, and RAIDz3.
>
     Yes, I had found both of those by following links from the ZFS material
at the freebsd.org web site.  However, lynx(1) is the only web browser I can
use at present because X11 was screwed on my system by an update that changed
the ABI for the server and various loadable modules, but did not update the
keyboard driver module or the pointing device driver module.  If I start X up,
it rejects those two driver modules due to the incompatible ABIs, so I have
no further influence on the system short of an ACPI shutdown triggered by
pushing the power button briefly.  Until I get the disk situation settled,
I have no easy way to rebuild X11.  Anyway, using lynx(1), it is very hard
to make any sense of the spreadsheet.

><snip>
>
>> Fast performance because each block		Slower performance because each
>> is on a separate spindle from the		block is spread across all
>> from the previous and next blocks.		spindles a la RAID3, so many
>> 						simultaneous I/O operations are
>> 						required for each block.
>
>ZFS performance is never that simple as I/O is requested from the drive in parallel. Unless you are saturating the controller you should be able to keep all the drive busy at once. Also note that ZFS does NOT suffer the RAID-5 read-modify-write penalty on writes as every write is a new write to disk (there is no modification of existing disk blocks), this is referred to as being Copy On Write (COW).
>
     Again, your use of single-line paragraphs makes it tough to respond to
your several points in-line.
     The information that I read on-line said that each raidz data block is
distributed across all devices in the raidzN device, just like in RAID3 or
RAID4.  That means that, whether reading or writing one data block, *all* of
he drives require a read or a write, not just one as would be the case in
RAID5.  So a raidzN device will require N I/O operations * m data blocks to
be read/written, not just m I/O operations.  That was the point I was making
in the table entry above, i.e., ZFS raidz, like RAID3 and RAID4, is many
times as I/O-intensive as RAID5.  In essence, reading or writing 100 data
blocks from a raidz is, at best, no faster than reading 100 blocks from a
single drive.  At worst, there will be bus conflicts leading to overruns and
full rotation delays in the process of gathering all the fragments in a
block, thus performing even slower than a single drive.  I.e., raidzN offers
no speed advantage to using multiple spindles, just like RAID3/RAID4.  In
other words, the data are not really striped but rather distributed in
parallel.  So I guess the question is, was what I read about raidz incorrect,
i.e., are individual data blocks *not* divided into a fragment on each and
every spindle minus the raidz level (number of parity dimensions)?
>> 				-----------------------
>>     I hoped to start with a minimal number of components and eventually
>> add more components to increase the space available in the raid5 or raidz
>> devices.  Increasing their sizes that way would also increase the total
>> percentage of space in the devices devoted to data rather than parity, as
>> well as improving the performance enhancement of the striping.  For various
>> reasons, having to replace all component spindles with larger-capacity
>> components is not a viable method of increasing the size of the raid5 or
>> raidz devices in my case.  That would appear to rule out raidz.
>
>Yup.

     Bummer.  Oh, well.
>
>>     OTOH, the very large-capacity drives available in the last two or
>> three years appear not to be very reliable(*) compared to older drives of
>> 1 TB or smaller capacities.  gvinum's raid5 appears not to offer good
>> protection against, nor any repair of, damaged data blocks.
>
>Yup. Unless you use ZFS plan on suffering silent data corruption due to the uncorrectable (and undetectable by the drive) error rate off of large drives. All drives suffer uncorrectable errors, read errors that the drive itself does not realize are errors. With traditional filesystems this bad data is returned to the OS and in some cases may cause a filesystem panic and in others just bad data returned to the application. This is one of the HUGE benefits of ZFS, it catches those errors.
>
     I think you've convinced me right there.  Although RAID[3456] offers
protection against drive failures, it offers no protection against silent
data corruption, which seems to be common on the large-capacity drives on
the market for the last three or four years.

><snip>
>
>>   Thanks to three failed external drives and
>> apparently not fully reliable replacements, compounded by a bad ports
>> update two or three months ago, I have no functioning X11 and no space
>> set up any longer in which to build ports to fix the X11 problem, so I
>> really want to get the disk situation settled ASAP.  Trying to keep track
>> of everything using only syscons and window(1) is wearing my patience
>> awfully thin.
>
>My home server is ZFS only and I have 2 drives mirrored for the OS and 5 drives in a raidz2 for data with one hot spare. I have suffered 3 drive failures (all Seagate), two of which took the four drives in my external enclosure offline (damn sata port multipliers). I have had NO data loss or corruption!

     Bravo, then.  Looks like ZFS raidz is what I need.  Unfortunately,
I only have four drives available for the raidz at present, so it looks
like I'll need to save up for at least one additional drive and probably
two for a raidz2 that doesn't sacrifice an unacceptably high fraction of
the total space to parity blocks. :-(  On my "budget" (ha!) that could be 
several months or more, by which time three of the four I currently have
will be out of warranty.  I suppose more failures could also occur during
that time.  Sigh.
>
>I started like you, wanting to have some drives and add more later. I started with a pair of 1TB drives mirrored, then added a second pair to double my capacity. The problem with 2-way mirrors is that the MTTDL (Mean Time To Data Loss) is much lower than with RAIDz2, with similar cost in spec for a 4 disk configuration. After I had a drive fail in the mirror configuration, I ordered a replacement and crossed my fingers that the other half to *that* mirror would not fail (the pairs of drives in the mirrors were the same make / model bought at the same time ? not a good bet for reliability). When I got the replacement drive(s) I took some time and rebuilt my configuration to better handle growth and reliability by going from a 4 disk 2-way mirror configuration to a 5 disk RAIDz2. I went from net about 2TB to net about 3TB capacity and a hot spare.

     Yeah, the mirrors never did look to me to be as good an option either.
>
>If being able to easily grow capacity is the primary goal I would go with a 2-way mirror configuration and always include a hot spare (so that *when* a drive fails it immediately starts resilvering (the ZFS term for syncing) the vdev). Then you can simple add pairs of drives to add capacity. Just make sure that the hot spare is at least as large as the largest drive in use. When you buy drives, always buy from as many different manufacturers and models as you can. I just bought four 2TB drives for my backup server. One is a WD, the other 3 are HGST but they are four different model drives, so that they did not come off the same production line on the same week as each other. If I could have I would have gotten four different manufacturers. I also only buy server class (rated for 24x7 operation with 5 year warranty) drives. The additional cost has been offset by the savings due to being able to have a failed drive replaced under warranty.

     I'm not familiar with HGST, but I will look into their products.
Where does one find the server-class drives for sale?  What sort of
price difference is there between the server-class and the ordinary
drives?
     And yes, I did run across the silly term used in ZFS for rebuilding
a drive's contents. :-}
>
>> (*) [Last year I got two defective 3 TB drives in a row from Seagate.
>
>Wow, the only time I have seen that kind of failure rate was buying from Newegg when they were packing them badly.

     At that time, the shop that was getting them for me (to put into a
third-party case with certain interfaces I needed at the time) told me
that, of the 3 TB Seagate drives they had gotten for their own use and
also for sale to customers who wanted them, only roughly 50% survived past
their first 30 days of use, and that none of the Western Digital 3 TB
drives had survived that long.  I concluded that the 3 TB drives were not
yet ready for prime time and should not have been marketed as early as
they were.  That was the reason for my insisting upon a 2 TB Seagate to
fill the third-party case.
>
>> I ended up settling for a 2 TB Seagate that is still running fine AFAIK.
>> While that process was going on, I bought three 2 TB Seagate drives in
>> external cases with USB 3.0 interfaces, two of which failed outright
>> after about 12 months and have been replaced with two refurbished drives
>> under warranty.
>
>Yup, they all replace failed drives with refurb.
>
>As a side note, on my home server I have had 6 Seagate ES.2 or ES.3 drives, 2 HGST UltraStar drives, and 2 WD RE4 in service on my home server. I have had 3 of the Seagates fail (and one of the Seagate replacements has failed, still under warranty). I have not had any HGST or WD drives fail (and they both have better performance than the Seagates). This does not mean that I do not buy Seagate drives. I spread my purchases around, keeping to the 24x7 5 year warranty drives and followup when I have a failure.

     I had a WD 1 TB drive fail last year.  It was just over three years
old at the time.
>
>>  While waiting for those replacements to arrive, I bought
>> a 2 TB Samsung drive in an external case with a USB 3.0 interface.  I
>> discovered by chance that copying very large files to these drives is an
>> error-prone process.
>
>I would suspect the USB 3.0 layer problem, but that is just a guess.

     There has been no evidence to support that conjecture so far.  What
the guy at Samsung/Seagate (they appear to be the same company now) told
me was that what I described did not mean that the drive was bad, but
instead was a common event with large-capacity drives.  He seemed to
think that the problems were associated with long-running series of write
operations, though he had no explanation for that.  It seems to me that
such errors being considered "normal" for these newer, larger-capacity
drives indicates the adoption of a drastically lowered standard of quality,
as compared to just a few years ago.  And if that is to be the way of disks
from now on, then self-correcting file systems will soon become the only
acceptable file systems for production use outside of scratch areas.
>
>>  A roughly 1.1 TB file on the one surviving external
>> Seagate drive from last year's purchase of three, when copied to the
>> Samsung drive, showed no I/O errors during the copy operation.  However,
>> a comparison check using "cmp -l -z originalfile copyoforiginal" shows
>> quite a few places where the contents don't match.
>
>ZFS would not tolerate those kinds of errors. On reading the file ZFS would know via the checksum that the file was bad.

     And ZFS would attempt to rewrite the bad block(s) with the correct
contents?  If so, would it then read back what it had written to make
sure the errors had, in fact, been corrected on the disk(s)?
>
>>  The same procedure
>> applied to one of the refurbished Seagates gives similar results, although
>> the locations and numbers of differing bytes are different from those on
>> the Samsung drive.  The same procedure applied to the other refurbished
>> drive resulted in a good copy the first time, but a later repetition ended
>> up with a copied file that differed from the original by a single bit in
>> each of two widely separated places in the files.  These problems have
>> raised the priority of a self-healing RAID device in my mind.
>
>Self healing RAID will be of little help? See more below

     Why would it be of little help?  What you wrote here seems to suggest
that it would be very helpful, at least for dealing with the kind of trouble
that caused me to start this thread.  Was the above just a typo of some kind?
>
>>     I have to say that these are new experiences to me.  The disk drives,
>> controllers, etc. that I grew up with all had parity checking in the hardware,
>> including the data encoded on the disks, so single-bit errors anywhere in
>> the process showed up as hardware I/O errors instantly.  If the errors were
>> not eliminated during a limited number of retries, they ended up as permanent
>> I/O errors that a human would have to resolve at some point.
>
>What controllers and drives? I have never seen a drive that does NOT have uncorrectable errors (these are undetectable by the drive). I have also never seen a controller that checksums the data. The controllers rely on the drive to report errors. If the drive does not report an error, then the controller trusts the data.

     Hmm... [scratches head a moment]  Well, IBM 1311, 2305, 2311, 2314,
3330, 3350, 3380, third-party equivalents of those, DEC RA80, Harris disks
(model numbers forgotten), HP disks (numbers also forgotten), Prime disks
(ditto).  Maybe some others that escape me now.  Tape drives until the
early 1990s that I worked with were all 9-track, so each byte was written
across 8 data tracks and 1 parity track.  Then we got a cartridge-based
system, and I *think* it may have been 10-track (i.e., 2 parity tracks).
Those computers and I/O subsystems and media had a parity bit for each byte
from the CPU and memory all the way out to the oxide on the media.  Anytime
odd parity was broken, the hardware detected it and passed an indication of
the error back to the operating system.
>
>The big difference is that with drives under 1TB the odds of running into an uncorrectable error over the life of the drive is very, very small. The uncorrectable error rate does NOT scale down as the drives scale up. It has been stable at 1 in 10e-14 (for cheap drives) to 1 in 10e-15 (for good drives) for over the past 10 years (when I started looking at that drive spec). So if the rate is not changing and the total amount of data written / read over the life of the drive join up by, in some cases, orders of magnitude, the real world occurrence of such errors is increasing.

     Interesting.  I wondered if that were all there were to it, rather
than the rate per gigabyte increasing due to the increased recording
density at the larger capacities.  I do think that newer drives have a
shorter MTBF than the drives of a decade ago, however.  I know that none
of the ones I've seen fail had served any 300,000+ hours that the
manufacturers were citing as MTBF values for their products.  One
"feature" of many newer drives is an automatic spindown whenever the drive
has been inactive for a short time.  The heating/cooling cycles that result
from these "energy-saving" or "standby" responses look to me like probable
culprits for drive failures.
     The spindowns also mean that such a drive cannot have a paging/swapping
area on it because the kernel will not wait five to ten seconds (while the
drive spins up) for a pagein to complete.  Instead, it will log an error
message on the console and will terminate the process that needed the page.
>
>>     FWIW, I also discovered that I cannot run two such multi-hour-long
>> copy operations in parallel using two separate pairs of drives.  Running
>> them together seems to go okay for a while, but eventually always results
>> in a panic.  This is on 9.2-STABLE (r264339).  I know that that is not up
>> to date, but I can't do anything about that until my disk hardware situation
>> is settled.]
>
>I have had mixed luck with large copy operations via USB on Freebsd 9.x Under 9.1 I have found it to be completely unreliable. With 9.2 I have managed without too many errors. USB really does not seem to be a good transport for large quantities of data at fast rates. See my rant on USB hubs here: http://pk1048.com/usb-beware/

     I was referring to kernel panics, not I/O errors.  These very long
copy operations all complete normally when run serially.  The panics
occur only when I run two such copies in parallel.
>
     I took a look at that link.  I've had good luck with Dynex USB 2.0
hubs, both powered and unpowered, but I've only bought their 4-port hubs,
not the 7-ports.  One of mine recently failed after at least five years
of service, possibly as long as seven years.
     However, the only hard drive I currently have connected via USB 2.0
is my oldest external drive, an 80 GB WD drive in an iomega case, and I
have yet to see any problems with it after nearly ten years of mostly
around-the-clock service.  The drives showing the errors I've described
in this thread are all connected via Connectland 4-port USB 3.0 hubs.
     I have some other ZFS questions, but this posting is very long
already, so I'll post them in a separate thread.
     Well, thank you very much for your reply.  I appreciate the helpful
information and perspectives from your actual experiences.  There are
some capabilities that I would very much like to see added to ZFS in the
future, but I think I can live with what it can already do right now, at
least for a few years.  The protection against data corruption, especially
of the silent type, is something I really, really want, and none of the
standard RAID versions seems to offer it, so I guess I'll have to go with
raidz and deal with the performance hit and the lack of a "grow" command
for raidz for now.


                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************