FreeBSD Machines dieing, we've tried so much....
matt at atopia.net
Wed Jun 22 15:59:14 GMT 2005
>The vast majority of panics are hardware-related. It is rare nowadays
>for a usermode program to make the system panic. In particular you said
>the problem happens more under load. That really points even more to a
>hardware problem - bad CPU cache ram, bad ram, scsi termination, that
>sort of thing.
This is kind of going to be a blanket post to all the recent suggestions
to me. I appreciate suggestions :) Ted, sorry, my other posts had
dmesg and hardware specs, etc. I just couldn't remember the subject line
of that thread. I'll be more descriptive here.
We have two different servers crashing. Both are SMP, but on different
hardware. We have five freeBSD servers in total, and only two are
affected. That is why I do not believe this is a hardware problem.
In any case, the machines are in a cold room where the temperature is
constantly maintained. 20 other servers in there are perfectly stable,
with no probs.
This particular machine that crashed last night while running portsdb
-uU is a Super Micro machine, with hyperthreading disabled in the bios,
dual CPU 3.06 ghz, with 4 gigs memory. We ran mem test on orion (the
machine that crashed last night) a week or so ago, and it found 70,000
ECC errors. Those were fixed and that machine has been stable until
last night. I've now disabled SMP support, we'll see if that keeps it
stable or not. Portsdb -uU ran without problems after I disabled SMP.
As far as uranus, the other box (we keep a planet scheme for a certain
set of servers), we ran memtest86 and found no errors at all. That box
crashed about two days ago but has been stable since. It has not lasted
more than a week without doing a kernel trap and freezing.
It seems that both these servers have this problem. Out of the five
FreeBSD servers we have, these two are the ones with the highest load.
Maybe a higher load on the other three servers would cause the same
problem. I agree with you that this is a hardware problem, but on more
than one server with two different architectures and our highest load
makes me re-consider.
If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is something
that has been fixed in -stable? I will compile a debug kernel today and
try to provide a trace to the problem. I'll do it on which ever server
More information about the freebsd-questions