Raid 1+0

Tue Apr 19 19:51:30 UTC 2016

Steve O'Hara-Smith wrote:

> On Mon, 18 Apr 2016 17:05:22 -0500 (CDT)
> "Valeri Galtsev" <galtsev at kicp.uchicago.edu> wrote:
> 
>> Not correct. First of all, in most of the cases, failure of each of the
>> drives are independent events
> 
> If only that were so. When the drives are as near identical as
> manufacturing can make them and have had very similar histories they can
> be expected to have very similar wear and be similarly close to failure at
> all times, which makes it likely that the load imposed by one failing will
> push another over.
> 

And the more of them you place in the same physical enclosure, the more 
vibration patterns and platter skew from either perfectly horizontal or 
perfectly vertical mounting generate complex interference patterns. The 
vibrational characteristics of the enclosure matter. In airframe 
superstructure testing vibration sensors (think seismology) are scattered 
throughout, then they use something that resembles a gun or an air hammer to 
bang on a point in order to map out how the resulting vibration will flow 
through the airframe. (Not my field of endeavor, something I learned from my 
dad).

I'm certainly not qualified to debate probability theory. My experience is 
anecdotal at best, but many sysadmins have witnessed various forms of drive 
failure(s) in raid arrays. Most have noticed over the years that it seems to 
occur most often when all drives come from the same manufacturing batch run 
and lot number. After enough of these a sysadmin will respond by shuffling 
the drives so they are not all from the same shipment, as well as when one 
does fail, get it swapped out ASAP before another goes and you lose the 
whole array.

Another pattern is simple age. I've seen drives that had run for so many 
years that all assumptions are they are OK. Power them down, and poof - just 
like that they don't come back. I've had arrays where one drive failed and 
when powered down some of the others would not come back up after power up. 
The answer to this is hot spare plus hot swap.

Anecdotal experience is no substitute for rigorous scientific proofs. Most 
sysadmins are not concerned with such, but rather keeping servers running 
and data flowing, almost to the point of superstition. Whatever works - use 
it.

-Mike