FreeBSD Crashes Intermittently !!
nightrecon at hotmail.com
Fri Mar 11 19:06:59 UTC 2016
shahzaib shahzaib wrote:
> I am new to this mailing list so please pardon me for any mistakes. We've
> started using FreeBSD from past 4-5 months and facing auto-reboot crash
> issue since the beginning. Following are the servers specs :
> Supermicro X5690 (12 cores, 24 threads - 2u)
> 96GB RAM
> 12x3TB mirror+stripping (HBA-LSI9211)
> X8DT3 Board
> We've total of 5 supermicro servers built upon same hardware and all of
> them intermittently goes down and sometimes they crash and boot up
> automatically (within 6min) and sometimes they gets freeze and we've to
> manually boot them via IPMI interface. All the time we get 'MCA Internal
> Timer Error' in crash logs. Here is the recent one :
> Once we reported this issue to our hardware vendor he said that its due to
> FreeBSD incompatibility with hardware and suggested us to try installing
> Linux on one of them and so did we proceeded with Debian on one of them
> them but all in vain and server was still crashing. Once we reported him
> about his failed proposal he then said that it could be related to
> application which is causing this crash.
He is just trying his best to point the finger somewhere else, anywhere else,
with the bottom line he doesn't want to let you return the machines and
refund your money. If the hardware has the same problems with different
operating systems something is wrong in the hardware.
> Now if he really is right then RAM should first swapped out to its full in
> order to make OS crash but never did that happened, we've never been out
> of Memory as 96GB RAM is pretty high. We've also took some precaution to
> debug this issue :
This "let's blame it on an application" will never produce positive results
if the problem is truly hardware related.
One long-standing and well known situation is poorly engineered hardware
usually gets "fixed" in the WIntel world by patching work around(s) into
driver code. This just hides the problem from the user. So in a situation
like this you will find the machine magically doesn't crash when running
Windows on it, but since these magic-bullet "fixes" do not map directly into
the Unix world it takes a lot more developer effort to achieve a similar
repair. When you see this effect, most of this is in (but not necessarily
limited to) driver code.
> Now i am confused if application really can crash server without swapping
> it out ? Could there be any php function which could make a crash :-| . Is
> FreeBSD is the cause of crash ? Things are pretty blurred right now :(.
If it also crashes with Debian why would you want to blame FreeBSD?
> Here is the Kernel tuning values :
My own personal recommendation is to simplify things down. I usually start
by choosing the "default" BIOS cmos settings. Usually there are two; one is
a bare bones default and the other is usually an "optimized defaults". I
usually always start with the "optimized" choice as it is still very
I would remove all customizations and reduce to pristine OS install with no
tunings. Even to the point of running the box with the LSI controller
disabled and run it on just a SATA drive or two. This gets the driver for
the LSI controller out of the way.
But let me back up the train first. First thing I'd do after setting BIOS to
defaults is to disable the HPET timer. Second thing I'd do is disable the
NUMA aware OS setting. Third thing I'd do is take the ipmi load out from
loader.conf. Also disable the entire USB subsystem at some point in the
experimentation to rule out fbsd's USB subsystem, etc.
Basically remove and strip things down until the problem goes away. Then you
have a smaller pool of possible subsystems causing the problem. The HPET
timer is best used in a Windows environment for synchronizing multimedia. If
you disable it in BIOS only to find that FreeBSD tries to utilize it anyway
it can be disabled in loader.conf with:
I mean, in the *Nix world do you really want micro millisecond time stamps
on all logging, just spinning CPU cycles and wasting performance? I suspect
*Nix systems run better without the HPET timer. My opinion (don't have
benchmarkings to prove).
In my experience software bugs usually present with a narrowing down to a
very specific sequence of steps that can reliably reproduce a problem.
Hardware problems, on the other hand, can show little or no pattern
whatsoever (totally erratic). And the intermittent hardware failure is the
absolute worst because you can only really troubleshoot during the period
the intermittent is showing. If you get and intermittent that is essentially
instantaneous, it happens and is gone. Very frustrating as this usually
reboots the box with little or no info left behind to go on.
However, I'd like to also point out that 5 machines all doing the same thing
is not likely to be an intermittent. IMHO this correlates to a hardware
compatibility situation. My first thoughts there are almost always the memory
subsystem. If the RAM has not had a proper engineering validation I'd look
at the list from SuperMicro and try and obtain something that has been
validated. Should this magically make the problem disappear, then the vendor
you bought the hardware from is putting RAM into boxen that does NOT have a
guarantee that it will work. I've seen machines that would behave fine during
normal operation but that would reboot only when running a make buildworld,
as this pushes the RAM quite a bit harder than regular day to day stuff.
Just a few $0.02 food for thought type things. It's nice to have the time to
be able to drill down and discover a solution. It's rewarding. But in the
real world as soon as I saw Debian produce the same situation I'd be on the
phone to RMA. If I had a little more time I might try Windows to see if it
somehow "Just Works". The datapoint here being it would point into the
possibility that WIntel is releasing driver patch work around(s) to cover up
poor hardware design. But really, I generally don't have this kind of time
and in order to meet deadlines sometimes have to go with a Plan B even if I
don't like it.
More information about the freebsd-questions