jjohnstone.nospamfreebsd at tridentusa.com
Wed Apr 27 16:39:48 UTC 2016
On 4/19/2016 8:59 PM, Kevin P. Neal wrote:
> "Disk failures in the real world:
> What does an MTTF of 1,000,000 hours mean to you?"
> by Bianca Schroeder Garth A. Gibson
> In FAST'07: 5th USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 14-16, 2007
> See especially section 5 starting on page 10: "Statistical properties of
> disk failures"
> There. Data has been provided. Just look at that paper. Now, let's see
> your real, hard data that drive failures in an array are reliably independent.
> And remember that the plural of "anecdote" is not "data".
This thread was so long, picking this one to reply to was somewhat
I don't have any "data" as such but just my observations based on my
experience. My company is a third-party maintenance provider. Nearly
all our customers have name-brand hardware from (in alphabetical order,
not percentage-wise) Dell, HP, IBM, Sun and various storage vendors.
Nearly all are server and storage systems, not desktops or laptops. We
have customers with just a few systems and customers with many systems.
Nearly all drives are in redundant configurations under hardware RAID
or SAN controllers. Many systems have just a single mirrorset and there
are systems with several shelves.
Our customers dictate when we replace drives after a failure. The
customers with a few systems don't have rigorous well defined policies
that establish when a drive replacement is to take place. The customers
with many systems do have such policies. Nearly all customers are fine
with waiting up to a week to do a drive replacement. This includes
customers that are uptime critical during normal business hours and
could do drive replacements during the week but instead feel it is safer
to wait until a Friday night to do the swap. I'm relating this to give
some perspective about the relative urgency of replacement after a failure.
When a customer opens a call for a drive replacement, for example in a
drive shelf that has 24 drives, we don't typically see another call for
a replacement in that same shelf within any noticeable period of time.
My definition of noticeable is within the next several months. This is
true for systems with just a single mirrorset as well.
There is the opinion that a rebuild after replacement causes enough of a
proportional increase in "load" to cause a drive failure. The term
"drive failure" here is really referring to the detection of the
failure. If there is a light usage scenario with little I/O, the second
drive failure that's detected may have actually occurred long before
first failure was detected. As others have said, this is the basis for
a "patrol read" type of function that hardware RAID controllers have
been capable of for many years.
It is a very rare occurrence that any of our customers experience data
loss with redundant disk configurations. In the few cases where there
has been loss, in almost all cases there was no automated software
detection of the errors. Drives were handled by hardware RAID or in SAN
systems but without any notification implemented. The drive error
lights were likely to have been lit but unnoticed for weeks, possibly
months before the data loss occurred.
Disk drives have a circuit board and are made to rotate and seek.
Compared to what drive manufacturers consider normal operating
characteristics for a drive, a rebuild does not cause electrical or
mechanical stress that is unusual or in any way excessive.
If a drive failure is detected during a rebuild it's much more likely
that the failure occurred prior to the rebuild and was detected during
the rebuild. Again, this is the basis for the patrol read idea and
things like a ZFS scrub.
RAID 10 is considered to be best practice because of it allowing
multiple drive failures due to drive unrecoverable error rates. This is
for new drives and without regard for similarity of drives from the same
More information about the freebsd-questions