SuperMicro P3TDL3-O locking under load with 4.8-REL

J. Seth Henry jshamlet at comcast.net
Wed May 14 13:49:44 PDT 2003


I built a server around this board about 8 months ago.

It has two 1GHz P-III processors, 1Gb of RAM using 4 identical Samsung
ECC registered DIMM's. There is an el-cheapo VGA card for console access,
two sound cards, and a digiboard (which is about to be removed).

Both the server, and all network equipment, were protected by a 750VA
SmartUPS (which ensures the power is fairly clean). The network gear is
still protected by the 750VA UPS, and the server is on a 650 VA Back-UPS
Pro. (read on, the server was moved) I have a 450W power supply in the
chassis.

The machine was intended to be a combination file/media server, X app
server, home automation controller, and compile mule (being the fastest
FreeBSD box in the house). When I first built it, I ran 4.7-REL on the
system. It also ran folding at home when it was idle.

This was a very stable setup for months, even during (and after) a heat
problem (the A/C failed to come on). The onboard thermal alarm was going
off, but the system was still running - so I manually halted the OS, and
powered it down. It ran non-stop for 3 months after this without so much
as a hiccup. I mentioned this because the symptoms are frightengly similar
to thermal problems I've seen on other systems.

Anyway, the overheat scared me enough to start reshuffling things.
The closet it was stored in averages ~85degF, as long as the A/C is
working - it got above 95deg when the A/C failed. I didn't want to risk
the machine melting down should the air go out again, so I replaced the
home automation system portion with a dedicated ITX based system, and
moved the server to a cooler room, with much better ventilation (average
temperature was 10degF cooler). (as a bonus, the closet temperature
dropped as well - without the server, it runs about 80degF)

OK - so, while I'm shuffling everything around, I figure it would be a
good time to upgrade the OS. I did a binary upgrade to 4.8-REL, installed
KDE 3.1 (to replace the icewm/mozilla combo), and upgraded a few other
packages in the process.

And now it is locking up... No kernel panics, no beeps, nothing. It just
stops. I've actually been typing in a remote xterm, and it's stopped
in mid-word... I've checked the temperature in the room, and in the case
- and both are well within tolerance. I can't check the chip temps,
because the ServerWorks LM78 setup isn't supported in FreeBSD (yet?), but
they don't appear to be to warm to the touch. Heck, the environment is
actually better than it was before!

Since the machine was physically moved, I checked the obvious. I reseated
all of the DIMM's, PCI boards - even the CPU's. I checked all the fans to
make sure they were still functional (they are). The machine appears to be
fine physically. Although I can't check after boot, I used the BIOS to
verify that the power supply voltages were OK as well (they were, though
the 12V line had dipped .13V to 11.87) The 5, 3.3, 2.5 supplies were spot
on at 5.07, 3.34, and 2.51. Apparently, the -12 and -5 weren't deemed
important enough to monitor.

So, in summary - The differences:
1) Room temp dropped from 85degF avg to 76degF avg. System spent 8 months
at 85degF (ambient air temp)

2) Went from 4.7-REL to 4.8-REL. System did not lock up in 4.7, despite
adverse conditions - does lock up regularly in 4.8 even with more ideal
conditions.

3) Starting serving up KDE 3.1 instead of icewm/mozilla/xterms (fairly
significant increase in network IO)

Although I'm looking for help, I'm going to try "downgrading" the server
to 4.7-REL, and see if that improves the situation. I'm also considering
pulling the drives out, and loading Linux on it, so I can monitor the LM78
subsystem, and put it under some extreme load.

I'm also looking at reducing the load, by stopping folding at home. It ran on
the system the entire time it was in the closet, but I'm desparate to
stabilize this box.

My suspicions, in order;

Power supply - even though the air from the vent isn't unusually warm,
this smells (so to speak), like a power problem. I REALLY wish I could
access the voltage monitoring stuff. This board can monitor damn near
everything, but FreeBSD doesn't support the monitoring hardware) Oh well,
looks like it's voltmeter time.

One or more CPU's are overheating under load, and some internal thermal
protection circuit is kicking in (natually causing the system to halt) I
would imagine that this, combined with a quirk in the kernel, is causing
this. My guess would be that CPU0 is crapping out, and since the kernel
can only run on the first CPU...

RAM - I bought the best I could afford, but it seems like a likely suspect
anyway. This seems unlikely, though. The board has reported exactly 2 ECC
"problems" in nearly 8 months in the BIOS log. However, it hasn't been
independently tested.

As an aside, I know Linux users have a tool to read the ServerWorks LM78
monitoring system. Is there anything in the works for FreeBSD support?
There are monitors for every voltage on the PSU, fan speeds, temperatures
etc - just sitting there waiting to be accessed.

As a SECOND aside, does anyone know of a reputable power supply vendor?
I'm willing to spend some cash for a high quality PSU - just as soon as I
find one. The current suppply is an Antec 450W.

Thanks in advance,
Seth Henry


More information about the freebsd-hardware mailing list