Hardware or OS problem? System Crashing...

Anthony Atkielski atkielski.anthony at wanadoo.fr
Thu Jan 6 19:21:08 PST 2005


I had a very similar problem over the holidays. After a power failure
over a month ago, I noticed some anomalies in FreeBSD, but they were
very insidious and didn't seem like hardware (and the system was on a
UPS plus a surge protector, so I didn't think the PF alone could have
done damage, unless the power cycled many times over a short period).
I'd get strange faults in programs from time to time, usually some type
of memory faults--usually in Apache (since it uses most of the processor
time), but sometimes in system programs that had never given trouble
before. As time passed, the system would occasionally freeze, or I would
even get kernel panics. There never seemed to be any information left
behind that could help me find out why the system was crashing (fault
type, processes running, etc.), and error messages in logs were scarce.
(If there is a way to debug FreeBSD crashes without running a kernel
specifically set up for the purpose, I'd like to know what it is.)

Anyway, I suspected a virus--I had seen a virus infection on the Web
server, but it had apparently never been activated because the firewall
prevented it from "calling home." FreeBSD had never faulted before, so
the OS was excluded (it would not _suddenly_ develop a bug). I
reinstalled everything just to see. It wasn't until I reinstalled and
upgraded to FreeBSD 5.3 and got even more frequent mystery crashes that
I felt sure that hardware was causing a problem.

It turned out that (I think) something had been damaged before or during
the power failures. A motherboard failure earlier on had turned off the
CPU fan. The fan worked, but the MB had stopped powering it, so it
wasn't running. The AMD processor stayed cool enough to operate most of
the time because the system is very lightly loaded processor-wise.
However, at some point, something got the system into a tight loop, and
the processor reached something above 120° C (around 300° F at one
point, I think--I could _smell_ the system when I got into the room).
Amazingly, it still ran most of the time, but I think some part of the
virtual memory logic was damaged, because most of the mystery faults
were segment violations. The problem very gradually got worse, with the
OS faulting more and more often, until it eventually got so bad that it
would fault before the bootload completed.

I finally replaced the entire machine--this time with _seven_ fans, and
with an Intel processor that will simply shut down if it gets too hot,
instead of cooking itself to death. I also upgraded to FreeBSD 5.3, and
I updated all the other system software as well. There have been no
problems since ... except for a panic in sysinstall during the first
installation, which I think was an honest-to-goodness OS bug (it
happened only once, and reminded me vaguely of a similar problem on my
first installation of 4.3, years earlier). The gigabit Ethernet on the
MB doesn't work reliably under FreeBSD, though, so I just reinstalled
the 100 Mbps card from the old server, which works perfectly.

In summary, this was a hardware problem, but so subtle in the beginning
that it wasn't at all clear that hardware was at fault--for a long time
I suspected traces of a virus infection or something.

Obviously, running Linux would not have made any difference.  I did see
filesystem corruption after the panics, which was to be expected, but as
far as I know I never lost any actual data; fsck corrected the structure
errors each time (sometimes from single-user mode, since it wouldn't
always succeed in automatic checks).  No OS can guarantee against data
corruption on unreliable hardware, not even all-knowing, all-seeing
Linux.

Maybe you need a new sysadmin.

--
Anthony




More information about the freebsd-questions mailing list