[Phishing]Re: Raid 1+0
Daniel Feenberg
feenberg at nber.org
Tue Apr 19 12:10:53 UTC 2016
Why do drive failures come in pairs?
[The following is based on Linux experience when the largest drives were
300GB - I think ZFS will do much better.]
Most of drives we have claim a MTBF of 500,000 hours. That's about 2% per
year. With three drives the chance of at least one failing is a little
less than 6%. (1-(1-.98)^3). Our experience is that such numbers are at
least a reasonable approximation of reality (but see Schroeder and Gibson
,2007).
Suppose you have three drives in a RAID 5. If it takes 24 hours to replace
and reconstruct a failed drive, one is tempted to calculate that the
chance of a second drive failing before full redundancy is established is
about .02/365, or about one in a hundred thousand. The total probability
of a double failure seems like it should be about 6 in a million per year.
Our double failure rate is worse than that - the many single drive
failures are followed by a second drive failure before redundancy is
established. This prevents rebuilding the array with a new drive replacing
the original failed drive, however you can probably recover most files if
you stay in degraded mode and copy the files to a different location. It
isn't that failures are correlated because drives are from the same batch,
or the controller is at fault, or the environment is bad (common
electrical spike or heat problem). The fault lies with the Linux md
driver, which stops rebuilding parity after a drive failure at the first
point it encounters a uncorrectable read error on the remaining "good"
drives. Of course with two drives unavailable, there isn't an unambiguous
reconstruction of the bad sector, so it might be best to go to the backups
instead of continuing. At least that is the apparently the reason for the
decision.
Alternatively, if the first drive failed was readable on that sector,
(even if not reading some other sectors) it should be possible to fully
recover all the data with a high degree of confidence even if a second
drive is failed later. Since that is far from an unusual situation (a
drive will be failed for a single uncorrectable error even if further
reads are possible on other sectors) it isn't clear to us why that isn't
done. [Lack of a slot for the bad drive?] Even if that sector isn't
readable, logging the bad block, writing something recognizable to the
targets, and going on might be better than simply giving up.
A single unreadable sector isn't unusual among the tens of millions of
sectors on a modern drive. If the sector has never been written to, there
is no occasion for the drive electronics or the OS to even know it is bad.
If the OS tried to write to it, the drive would automatically remap the
sector and no damage would be done - not even a log entry. But that one
bad sector will render the entire array unrecoverable no matter where on
the disk it is if one other drive has already been failed.
Let's repeat the reliability calculation with our new knowledge of the
situation. In our experience perhaps half of drives have at least one
unreadable sector in the first year. Again assume a 6 percent chance of a
single failure. The chance of at least one of the remaining two drives
having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is
about 4.5%/year, which is .5% MORE than the 4% failure rate one would
expect from a two drive RAID 0 with the same capacity. Alternatively, if
you just had two drives with a partition on each and no RAID of any kind,
the chance of a failure would still be 4%/year but only half the data loss
per incident, which is considerably better than the RAID 5 can even hope
for under the current reconstruction policy even with the most expensive
hardware.
The 3ware controller, has a "continue on error" rebuild policy available
as an option in the array setup. But we would really like to know more
about just what that means. What do the apparently similar RAID
controllers from Mylex, LSI Logic and Adaptec do about this? A look at
their web sites reveals no information. For some time
now we have stuck with software raid, because it renders the drives pretty
much hardware independent and there doesn't appear to be much of a
performance loss.
daniel feenberg
More information about the freebsd-questions
mailing list