FreeBSD 6.x CVSUP today crashes with zero load ...
Dmitry Pryanishnikov
dmitry at atlantis.dp.ua
Mon Jun 26 23:44:41 UTC 2006
On Tue, 27 Jun 2006, M.Hirsch wrote:
>> If you're using hardware w/o ECC, it just can't tell whether error present
>> or absent. So ECC _is_ the way to detect (not mask) broken hardware.
>>
> Ok, thanks. I think I understand the meaning of ECC now.
> So, unlike my supplier claims, ECC is not supposed to help against hardware
> failures.
> But it is the way to detect them, right?
ECC stands for Error Checking and Correction. It's a hardware feature,
and its primary task is Checking (that is, detection) of errors. It just
happens that number of additional bits which carry checking code is sufficient
to correct _any_ _single-bit_ data error (not mask it, but really correct),
and to detect any double-bit and most of several-bit errors (w/o
correction).
>> Intel's ECC-capable chipset allows it. But if we're speaking about
>> production environment, such behaviour (abnormal termination on _corrected_
>> error) is unacceptable.
>
> "abnormal termination" is not only acceptable for me, it is what I am looking
> for.
> Make the node crash completely, so one of the others can take over its
> task(s).
Again, when single-bit correction has happened, it's not fake, the result is
actually correct. Why panic the machine immediately if all data OK?
Sincerely, Dmitry
--
Atlantis ISP, System Administrator
e-mail: dmitry at atlantis.dp.ua
nic-hdl: LYNX-RIPE
More information about the freebsd-stable
mailing list