[Phishing]Re: Raid 1+0

Tue Apr 19 12:10:53 UTC 2016

Why do drive failures come in pairs?

[The following is based on Linux experience when the largest drives were 
300GB - I think ZFS will do much better.]

Most of drives we have claim a MTBF of 500,000 hours. That's about 2% per 
year. With three drives the chance of at least one failing is a little 
less than 6%. (1-(1-.98)^3). Our experience is that such numbers are at 
least a reasonable approximation of reality (but see Schroeder and Gibson 
,2007).

Suppose you have three drives in a RAID 5. If it takes 24 hours to replace 
and reconstruct a failed drive, one is tempted to calculate that the 
chance of a second drive failing before full redundancy is established is 
about .02/365, or about one in a hundred thousand. The total probability 
of a double failure seems like it should be about 6 in a million per year.

Our double failure rate is worse than that - the many single drive 
failures are followed by a second drive failure before redundancy is 
established. This prevents rebuilding the array with a new drive replacing 
the original failed drive, however you can probably recover most files if 
you stay in degraded mode and copy the files to a different location. It 
isn't that failures are correlated because drives are from the same batch, 
or the controller is at fault, or the environment is bad (common 
electrical spike or heat problem). The fault lies with the Linux md 
driver, which stops rebuilding parity after a drive failure at the first 
point it encounters a uncorrectable read error on the remaining "good" 
drives. Of course with two drives unavailable, there isn't an unambiguous 
reconstruction of the bad sector, so it might be best to go to the backups 
instead of continuing. At least that is the apparently the reason for the 
decision.

Alternatively, if the first drive failed was readable on that sector, 
(even if not reading some other sectors) it should be possible to fully 
recover all the data with a high degree of confidence even if a second 
drive is failed later. Since that is far from an unusual situation (a 
drive will be failed for a single uncorrectable error even if further 
reads are possible on other sectors) it isn't clear to us why that isn't 
done. [Lack of a slot for the bad drive?] Even if that sector isn't 
readable, logging the bad block, writing something recognizable to the 
targets, and going on might be better than simply giving up.

A single unreadable sector isn't unusual among the tens of millions of 
sectors on a modern drive. If the sector has never been written to, there 
is no occasion for the drive electronics or the OS to even know it is bad. 
If the OS tried to write to it, the drive would automatically remap the 
sector and no damage would be done - not even a log entry. But that one 
bad sector will render the entire array unrecoverable no matter where on 
the disk it is if one other drive has already been failed.

Let's repeat the reliability calculation with our new knowledge of the 
situation. In our experience perhaps half of drives have at least one 
unreadable sector in the first year. Again assume a 6 percent chance of a 
single failure. The chance of at least one of the remaining two drives 
having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is 
about 4.5%/year, which is .5% MORE than the 4% failure rate one would 
expect from a two drive RAID 0 with the same capacity. Alternatively, if 
you just had two drives with a partition on each and no RAID of any kind, 
the chance of a failure would still be 4%/year but only half the data loss 
per incident, which is considerably better than the RAID 5 can even hope 
for under the current reconstruction policy even with the most expensive 
hardware.

The 3ware controller, has a "continue on error" rebuild policy available 
as an option in the array setup. But we would really like to know more 
about just what that means. What do the apparently similar RAID 
controllers from Mylex, LSI Logic and Adaptec do about this? A look at 
their web sites reveals no information. For some time 
now we have stuck with software raid, because it renders the drives pretty 
much hardware independent and there doesn't appear to be much of a 
performance loss.

daniel feenberg