FreeBSD Crash without Errors, Warnings, or Panics

Paul Saab ps at freebsd.org
Fri Apr 14 00:01:22 UTC 2006


There are serious race conditions with amr in 6.0 that can cause serious 
hangs.  I suggest you take the amr driver from RELENG_6 and try that.

Matthew Hagerty wrote:
> Greetings,
>
> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon Intel 
> motherboard with a LSILogic MegaRAID (amr0) controller.  This machine 
> has been running for about 2 years now, and was very stable until I 
> updated from 5.3 to 5.4, and now 6.0.  The crashing seems to be 
> totally random and I have had it crash in as little as 12 hours and as 
> long as 143 days.
>
> When the box goes down it does so in a strange way.  First, it still 
> responds to network probes like ping (usually), however, all console 
> access is ignored.  Also, some network ports still respond, like a 
> telnet to port 22 to test SSH will yield an SSH banner, but trying to 
> connect with SSH just hangs.  Sometimes this is also true of the SMTP 
> server, but not always.  This also makes it impossible for me to use 
> CARP to swap to the recently purchased spare machine, since the 
> network interface is generally still responding so CARP does not 
> detect a problem.
>
> My biggest problem with this is that there are *never* any console 
> messages or log entries in any logs, no warnings about disk failure, 
> buffer exhaustion, system failures, etc..  The machine simply seems to 
> stop responding and the only way to correct the problem is a hard reboot.
>
> A strange thing did happen yesterday though, I believe I caught the 
> box on the verge of failure.  I was SSH'd in and did a ps to check 
> things out.  There were about 100 of these entries:
>
> 55050  ??  D      0:00.00 postmaster: ipa ipa ::1(63061) startup 
> (postgres)
>
> The box runs a web-based app and connects to a local Postgres DB which 
> seemed to be unable to start new connections being requested by the 
> PHP scripts.  At any rate, I stopped Apache and then tried to stop 
> Postgres which resulted in (or just happened to coincide with) the box 
> locking up and no longer responding to my SSH commands or attempts to 
> reconnect with SSH.  I hardly think this is a Postgres problem, but 
> even if it was, a userland app should *not* be able to bring down a 
> box...
>
> Can anyone shed some light on this, give me some options to try?  What 
> happened to kernel panics and such when there were serious errors 
> going on?  The only glimmer of information I have is that *one* time 
> there was an error on the console about there not being any RAID 
> controller available.  I did purchase a spare controller and I'm about 
> to swap it out and see if it helps, but for some reason I doubt it.  
> If a controller like that was failing, I would certainly hope to see 
> some serious error messages or panics going on.
>
> I have been running FreeBSD since version 1.01 and have never had a 
> box so unstable in the last 12 or so years, especially one that is 
> supposed to be "server" quality instead of the make-shift ones I put 
> together with desktop hardware.  And last, I'm getting sick of my 
> Linux admin friends telling me "told you so!  should have run 
> Linux...", please give me something to stick in their pie holes!
>
> Thanks,
> Matthew
>
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to 
> "freebsd-hackers-unsubscribe at freebsd.org"
>


More information about the freebsd-hackers mailing list