FreeBSD 6.x CVSUP today crashes with zero load ...

Mon Jun 26 22:57:29 UTC 2006

On Tue, 27 Jun 2006, M.Hirsch wrote:
>> On Mon, 26 Jun 2006, M.Hirsch wrote:
>>> ECC is a way to mask broken hardware. I rather have my hardware fail 
>>> directly when it does first, so I can replace it _immediately_
>> 
>>
>>  You got it backwards. If your data has any value to you, then you don't 
>> 
> Nope, I am right on track.
> I do not want to lose any data. So I'd prefer a ECC error to raise a panic so 
> I can replace the hardware ASAP.

  When you wrote "ECC is a way to mask broken hardware", you were plain wrong.
If you're using hardware w/o ECC, it just can't tell whether error present
or absent. So ECC _is_ the way to detect (not mask) broken hardware.

  If you want ECC corrector to raise NMI on corrected error (as well as 
uncorrectable), just set approproate bit in control register - every
Intel's ECC-capable chipset allows it. But if we're speaking about
production environment, such behaviour (abnormal termination on _corrected_
error) is unacceptable.

> Don't get me wrong, but tracking bugs in FreeBSD is quite more of an effort 
> than "just" akquiring a new box...

  I don't see connection between this sentence and ECC (which is hardware 
option).

> Does the standard fs, UFS2, do "extra sanity checks", then?

  Ditto. And don't forget that _every_ data sector on HDD _is_ checked
with CRC. As well as ATA data transfers in UDMA modes. As well as data
in CPU cache. Extra check gives extra reliability.

Sincerely, Dmitry
-- 
Atlantis ISP, System Administrator
e-mail:  dmitry at atlantis.dp.ua
nic-hdl: LYNX-RIPE