Matthew Hagerty matthew at
Thu Apr 13 18:17:42 UTC 2006


I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon Intel 
motherboard with a LSILogic MegaRAID (amr0) controller.  This machine 
has been running for about 2 years now, and was very stable until I 
updated from 5.3 to 5.4, and now 6.0.  The crashing seems to be totally 
random and I have had it crash in as little as 12 hours and as long as 
143 days.

When the box goes down it does so in a strange way.  First, it still 
responds to network probes like ping (usually), however, all console 
access is ignored.  Also, some network ports still respond, like a 
telnet to port 22 to test SSH will yield an SSH banner, but trying to 
connect with SSH just hangs.  Sometimes this is also true of the SMTP 
server, but not always.  This also makes it impossible for me to use 
CARP to swap to the recently purchased spare machine, since the network 
interface is generally still responding so CARP does not detect a problem.

My biggest problem with this is that there are *never* any console 
messages or log entries in any logs, no warnings about disk failure, 
buffer exhaustion, system failures, etc..  The machine simply seems to 
stop responding and the only way to correct the problem is a hard reboot.

A strange thing did happen yesterday though, I believe I caught the box 
on the verge of failure.  I was SSH'd in and did a ps to check things 
out.  There were about 100 of these entries:

55050  ??  D      0:00.00 postmaster: ipa ipa ::1(63061) startup (postgres)

The box runs a web-based app and connects to a local Postgres DB which 
seemed to be unable to start new connections being requested by the PHP 
scripts.  At any rate, I stopped Apache and then tried to stop Postgres 
which resulted in (or just happened to coincide with) the box locking up 
and no longer responding to my SSH commands or attempts to reconnect 
with SSH.  I hardly think this is a Postgres problem, but even if it 
was, a userland app should *not* be able to bring down a box...

Can anyone shed some light on this, give me some options to try?  What 
happened to kernel panics and such when there were serious errors going 
on?  The only glimmer of information I have is that *one* time there was 
an error on the console about there not being any RAID controller 
available.  I did purchase a spare controller and I'm about to swap it 
out and see if it helps, but for some reason I doubt it.  If a 
controller like that was failing, I would certainly hope to see some 
serious error messages or panics going on.

I have been running FreeBSD since version 1.01 and have never had a box 
so unstable in the last 12 or so years, especially one that is supposed 
to be "server" quality instead of the make-shift ones I put together 
with desktop hardware.  And last, I'm getting sick of my Linux admin 
friends telling me "told you so!  should have run Linux...", please give 
me something to stick in their pie holes!


