galtsev at kicp.uchicago.edu
Tue Apr 19 16:52:21 UTC 2016
On Tue, April 19, 2016 11:16 am, Lowell Gilbert wrote:
> "Valeri Galtsev" <galtsev at kicp.uchicago.edu> writes:
>> Somebody with better knowledge of probability theory will correct me if
>> I'm wrong some place.
> Well, you are assuming that the probabilities of two drives failing
> are entirely independent of each other. The person to whom you are
> responding asserted that this is not the case. Neither of you
> presented any evidence directly to that point.
Correct, we didn't hear proof of one or another. I, however, can not think
of any physical mechanism that can be involved which will lead in case of
failure of one drive to failure of another. That is why I assume events
are (pretty much) independent.
What can cause drive failure?
1. Pure mechanical reasons: head broke off, huge spot on platter surface
deteriorated, new big scratch of platter was made, dirt particle left by
manufacturer inside drive unstuck and started flying around,... None of
these will suddenly affect drives sitting in the same enclosure
2. Electric problems: drive electronics burned out,... I safely assume
that other drives, even sitting on the same power lines, are unlikely to
be affected. It usually is a small brave piece of semiconductor that burs
out and saves everybody else on the same power lines because short circuit
behind it becomes disconnected, and burning away a piece of semiconductor
doesn't require awful amount of power, and there is plenty where it comes
from to feed many drives.
So far I don't see any physical scenario by which failure of one drive can
change probability of failure of another drive. To prove the events are
not independent one needs some physical mechanism responsible for that.
Without that events are independent in my opinion (exactly as I observe in
my server room for over a decade and a half).
But if someone observes different things (or even observed once) in one's
server room, I really would like to know all the details. If they prove me
wrong, I will learn something and change my hardware policies to make our
equipment more reliable. I was keen getting this information whenever I
was coming across these stories, but so far all "multiple failures"
stories boiled down to one failure that happened long ago, and another
failure that triggered merely a discovery of older failure, not the
So, "double failure" stories with all details are something that would be
great to study closely. But if someone can suggest (purely theoretically)
physical mechanism how one drive failure can induce (grossly increase
probability of) another drive failure it will be really great to hear.
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
More information about the freebsd-questions