Raid 1+0

Wed Apr 27 16:39:48 UTC 2016

On 4/19/2016 8:59 PM, Kevin P. Neal wrote:

> "Disk failures in the real world:
> What does an MTTF of 1,000,000 hours mean to you?"
> by Bianca Schroeder Garth A. Gibson
> In FAST'07: 5th USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 14-16, 2007
> http://www.cs.toronto.edu/~bianca/papers/fast07.pdf
>
> See especially section 5 starting on page 10: "Statistical properties of
> disk failures"
>
> There. Data has been provided. Just look at that paper. Now, let's see
> your real, hard data that drive failures in an array are reliably independent.
> And remember that the plural of "anecdote" is not "data".

This thread was so long, picking this one to reply to was somewhat 
arbitrary.

I don't have any "data" as such but just my observations based on my 
experience.  My company is a third-party maintenance provider.  Nearly 
all our customers have name-brand hardware from (in alphabetical order, 
not percentage-wise) Dell, HP, IBM, Sun and various storage vendors. 
Nearly all are server and storage systems, not desktops or laptops.  We 
have customers with just a few systems and customers with many systems. 
  Nearly all drives are in redundant configurations under hardware RAID 
or SAN controllers.  Many systems have just a single mirrorset and there 
are systems with several shelves.

Our customers dictate when we replace drives after a failure.  The 
customers with a few systems don't have rigorous well defined policies 
that establish when a drive replacement is to take place.  The customers 
with many systems do have such policies.  Nearly all customers are fine 
with waiting up to a week to do a drive replacement.  This includes 
customers that are uptime critical during normal business hours and 
could do drive replacements during the week but instead feel it is safer 
to wait until a Friday night to do the swap.  I'm relating this to give 
some perspective about the relative urgency of replacement after a failure.

When a customer opens a call for a drive replacement, for example in a 
drive shelf that has 24 drives, we don't typically see another call for 
a replacement in that same shelf within any noticeable period of time. 
My definition of noticeable is within the next several months.  This is 
true for systems with just a single mirrorset as well.

There is the opinion that a rebuild after replacement causes enough of a 
proportional increase in "load" to cause a drive failure.  The term 
"drive failure" here is really referring to the detection of the 
failure.  If there is a light usage scenario with little I/O, the second 
drive failure that's detected may have actually occurred long before 
first failure was detected.  As others have said, this is the basis for 
a "patrol read" type of function that hardware RAID controllers have 
been capable of for many years.

It is a very rare occurrence that any of our customers experience data 
loss with redundant disk configurations.  In the few cases where there 
has been loss, in almost all cases there was no automated software 
detection of the errors.  Drives were handled by hardware RAID or in SAN 
systems but without any notification implemented.  The drive error 
lights were likely to have been lit but unnoticed for weeks, possibly 
months before the data loss occurred.

Disk drives have a circuit board and are made to rotate and seek. 
Compared to what drive manufacturers consider normal operating 
characteristics for a drive, a rebuild does not cause electrical or 
mechanical stress that is unusual or in any way excessive.

If a drive failure is detected during a rebuild it's much more likely 
that the failure occurred prior to the rebuild and was detected during 
the rebuild.  Again, this is the basis for the patrol read idea and 
things like a ZFS scrub.

RAID 10 is considered to be best practice because of it allowing 
multiple drive failures due to drive unrecoverable error rates.  This is 
for new drives and without regard for similarity of drives from the same 
manufacturer.

-
John J.