FreeBSD Crash without Errors, Warnings, or Panics
John Baldwin
jhb at freebsd.org
Thu Apr 13 19:34:14 UTC 2006
On Thursday 13 April 2006 15:15, Julian Elischer wrote:
> Matthew Hagerty wrote:
>
> > John Baldwin wrote:
> >
> >> On Thursday 13 April 2006 14:17, Matthew Hagerty wrote:
> >>
> >>
> >>> Greetings,
> >>>
> >>> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon
> >>> Intel motherboard with a LSILogic MegaRAID (amr0) controller. This
> >>> machine has been running for about 2 years now, and was very stable
> >>> until I updated from 5.3 to 5.4, and now 6.0. The crashing seems to
> >>> be totally random and I have had it crash in as little as 12 hours
> >>> and as long as 143 days.
> >>>
> >>> When the box goes down it does so in a strange way. First, it still
> >>> responds to network probes like ping (usually), however, all console
> >>> access is ignored. Also, some network ports still respond, like a
> >>> telnet to port 22 to test SSH will yield an SSH banner, but trying
> >>> to connect with SSH just hangs. Sometimes this is also true of the
> >>> SMTP server, but not always. This also makes it impossible for me
> >>> to use CARP to swap to the recently purchased spare machine, since
> >>> the network interface is generally still responding so CARP does not
> >>> detect a problem.
> >>>
> >>> My biggest problem with this is that there are *never* any console
> >>> messages or log entries in any logs, no warnings about disk failure,
> >>> buffer exhaustion, system failures, etc.. The machine simply seems
> >>> to stop responding and the only way to correct the problem is a hard
> >>> reboot.
> >>>
> >>> A strange thing did happen yesterday though, I believe I caught the
> >>> box on the verge of failure. I was SSH'd in and did a ps to check
> >>> things out. There were about 100 of these entries:
> >>>
> >>> 55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup
> >>> (postgres)
> >>>
> >>> The box runs a web-based app and connects to a local Postgres DB
> >>> which seemed to be unable to start new connections being requested
> >>> by the PHP scripts. At any rate, I stopped Apache and then tried to
> >>> stop Postgres which resulted in (or just happened to coincide with)
> >>> the box locking up and no longer responding to my SSH commands or
> >>> attempts to reconnect with SSH. I hardly think this is a Postgres
> >>> problem, but even if it was, a userland app should *not* be able to
> >>> bring down a box...
> >>>
> >>> Can anyone shed some light on this, give me some options to try?
> >>> What happened to kernel panics and such when there were serious
> >>> errors going on? The only glimmer of information I have is that
> >>> *one* time there was an error on the console about there not being
> >>> any RAID controller available. I did purchase a spare controller
> >>> and I'm about to swap it out and see if it helps, but for some
> >>> reason I doubt it. If a controller like that was failing, I would
> >>> certainly hope to see some serious error messages or panics going on.
> >>>
> >>> I have been running FreeBSD since version 1.01 and have never had a
> >>> box so unstable in the last 12 or so years, especially one that is
> >>> supposed to be "server" quality instead of the make-shift ones I put
> >>> together with desktop hardware. And last, I'm getting sick of my
> >>> Linux admin friends telling me "told you so! should have run
> >>> Linux...", please give me something to stick in their pie holes!
> >>>
> >>
> >>
> >> It sounds like a livelock (or deadlock) more than a crash. Can you add
> >> 'DDB' in your kernel config and break into the debugger when it hangs
> >> and grab the output of 'ps'?
> >>
> >>
> >
> > I can probably figure out how to compile in DDB (I've never done if
> > before though), but just two questions:
>
>
> add
> options DDB
> to your kenrnel config file.
>
> >
> > 1. How do I break into DDB and grab the ps output?
>
> on the console, hit <CTRL><ALT><ESC> keys (at once)
>
> that should put you into the debugger..
>
> then 'ps' will give you some output.
>
> It's a lot to write down but I've found a camera phone makes good enough
> snapshots :-)
>
> alternatively you can use a serial console, but getting into the
> debugger is harder,
> you have to have compiled in ALT_BREAK_TO_DEBUGGER
> into your kernel by adding
>
> # Solaris implements a new BREAK which is initiated by a character
> # sequence CR ~ ^b which is similar to a familiar pattern used on
> # Sun servers by the Remote Console.
> options ALT_BREAK_TO_DEBUGGER
>
> to the kernel config file you are using..
Or jsut use 'options BREAK_TO_DEBUGGER' and send a serial break signal
to break into the debugger.
Matthew,
There's also a chapter in the handbook that explains how to use ddb,
setup a serial console, etc.
--
John Baldwin <jhb at FreeBSD.org> <>< http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve" = http://www.FreeBSD.org
More information about the freebsd-hackers
mailing list