FreeBSD Crash without Errors, Warnings, or Panics
matthew at digitalstratum.com
Thu Apr 13 19:06:00 UTC 2006
John Baldwin wrote:
> On Thursday 13 April 2006 14:17, Matthew Hagerty wrote:
>> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon Intel
>> motherboard with a LSILogic MegaRAID (amr0) controller. This machine
>> has been running for about 2 years now, and was very stable until I
>> updated from 5.3 to 5.4, and now 6.0. The crashing seems to be totally
>> random and I have had it crash in as little as 12 hours and as long as
>> 143 days.
>> When the box goes down it does so in a strange way. First, it still
>> responds to network probes like ping (usually), however, all console
>> access is ignored. Also, some network ports still respond, like a
>> telnet to port 22 to test SSH will yield an SSH banner, but trying to
>> connect with SSH just hangs. Sometimes this is also true of the SMTP
>> server, but not always. This also makes it impossible for me to use
>> CARP to swap to the recently purchased spare machine, since the network
>> interface is generally still responding so CARP does not detect a problem.
>> My biggest problem with this is that there are *never* any console
>> messages or log entries in any logs, no warnings about disk failure,
>> buffer exhaustion, system failures, etc.. The machine simply seems to
>> stop responding and the only way to correct the problem is a hard reboot.
>> A strange thing did happen yesterday though, I believe I caught the box
>> on the verge of failure. I was SSH'd in and did a ps to check things
>> out. There were about 100 of these entries:
>> 55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup (postgres)
>> The box runs a web-based app and connects to a local Postgres DB which
>> seemed to be unable to start new connections being requested by the PHP
>> scripts. At any rate, I stopped Apache and then tried to stop Postgres
>> which resulted in (or just happened to coincide with) the box locking up
>> and no longer responding to my SSH commands or attempts to reconnect
>> with SSH. I hardly think this is a Postgres problem, but even if it
>> was, a userland app should *not* be able to bring down a box...
>> Can anyone shed some light on this, give me some options to try? What
>> happened to kernel panics and such when there were serious errors going
>> on? The only glimmer of information I have is that *one* time there was
>> an error on the console about there not being any RAID controller
>> available. I did purchase a spare controller and I'm about to swap it
>> out and see if it helps, but for some reason I doubt it. If a
>> controller like that was failing, I would certainly hope to see some
>> serious error messages or panics going on.
>> I have been running FreeBSD since version 1.01 and have never had a box
>> so unstable in the last 12 or so years, especially one that is supposed
>> to be "server" quality instead of the make-shift ones I put together
>> with desktop hardware. And last, I'm getting sick of my Linux admin
>> friends telling me "told you so! should have run Linux...", please give
>> me something to stick in their pie holes!
> It sounds like a livelock (or deadlock) more than a crash. Can you add
> 'DDB' in your kernel config and break into the debugger when it hangs
> and grab the output of 'ps'?
I can probably figure out how to compile in DDB (I've never done if
before though), but just two questions:
1. How do I break into DDB and grab the ps output?
2. How can I login if the box is not responding to SSH or the console?
It was only by sheer luck that I caught it yesterday just before the
lockup, I have never been able to do that before.
More information about the freebsd-hackers