FreeBSD 6.x CVSUP today crashes with zero load ...

M.Hirsch webmaster at hirsch.it
Mon Jun 26 23:07:39 UTC 2006


Dmitry Pryanishnikov schrieb:

> When you wrote "ECC is a way to mask broken hardware", you were plain 
> wrong.
> If you're using hardware w/o ECC, it just can't tell whether error 
> present
> or absent. So ECC _is_ the way to detect (not mask) broken hardware.
>
Ok, thanks. I think I understand the meaning of ECC now.
So, unlike my supplier claims, ECC is not supposed to help against 
hardware failures.
But it is the way to detect them, right?

>  If you want ECC corrector to raise NMI on corrected error (as well as 
> uncorrectable), just set approproate bit in control register - every
> Intel's ECC-capable chipset allows it. But if we're speaking about
> production environment, such behaviour (abnormal termination on 
> _corrected_
> error) is unacceptable.

"abnormal termination" is not only acceptable for me, it is what I am 
looking for.
Make the node crash completely, so one of the others can take over its 
task(s).

> Don't get me wrong, but tracking bugs in FreeBSD is quite more of an 
> effort than "just" akquiring a new box...
>
>  I don't see connection between this sentence and ECC (which is 
> hardware option).

What I wanted to say:
Looking for errors in the logs is only a few seconds.
Finding out what caused them, is hours...
Akquiring a new box is only $29,95 ;) - that's like 30 minutes, if you 
regard it from the business side. ... I rather rent 100 boxes to do the 
task of ten, than employ 100 admins to find the "real" problem.

Thanks, Dmitry. I think I know what to look for now...

M.


More information about the freebsd-stable mailing list