FreeBSD Crashes Intermittently !!

Valeri Galtsev galtsev at kicp.uchicago.edu
Wed Mar 9 17:03:09 UTC 2016


On Wed, March 9, 2016 6:24 am, shahzaib shahzaib wrote:
> Hi,
>
> I am new to this mailing list so please pardon me for any mistakes. We've
> started using FreeBSD from past 4-5 months and facing auto-reboot crash
> issue since the beginning. Following are the servers specs :
>
> Supermicro X5690 (12 cores, 24 threads - 2u)
> 96GB RAM
> 12x3TB mirror+stripping (HBA-LSI9211)
> X8DT3 Board
>
> We've total of 5 supermicro servers built upon same hardware and all of
> them intermittently goes down and sometimes they crash and boot up
> automatically (within 6min) and sometimes they gets freeze and we've to
> manually boot them via IPMI interface. All the time we get 'MCA Internal
> Timer Error' in crash logs. Here is the recent one :
>
> http://pastebin.com/042SJ11c
>
> Once we reported this issue to our hardware vendor he said that its due to
> FreeBSD incompatibility with hardware and suggested us to try installing
> Linux on one of them and so did we proceeded with Debian on one of them
> them but all in vain and server was still crashing. Once we reported him
> about his failed proposal he then said that it could be related to
> application which is causing this crash.

Not correct. Normally neither on FreeBSD, nor on Linux application will
not be able to crash the system. The worst that could happen, application
related process (or processes) will die or get killed. You quite likely
have hardware problem. It doesn't seem your hardware vendor did burn-in
test of your boxes.

>
> Now if he really is right then RAM should first swapped out to its full in
> order to make OS crash

Not correct. Normally if you run system out of memory(including swap) one
or few processes can get killed, but the system will not crash. It may
have appearance of getting locked (unresponsive) for some time, in the
case you have large swap (as with process switching it will have to swap
in memory pages, and swap something out to switch to next process, that is
why I prefer not to have swap on huge memory boxes, or never have large
swap).

> but never did that happened, we've never been out
> of
> Memory as 96GB RAM is pretty high. We've also took some precaution to
> debug
> this issue :
>
> - Replacing Power-Supply.
> - Reducing CMOS in BIOS.
> - Disabling Intel Powersaving features.
> - Upgrade Bios
>
>
> Now we do not know how and what to debug. If you need more details, please
> visit following thread which we created 2 months back :
>
> https://forums.freebsd.org/threads/54412/
>

To simplify your life, update to the latest (and yes, stay with RELEASE -
which I see you have in that forum thread).

> Now i am confused if application really can crash server without swapping
> it out ? Could there be any php function which could make a crash :-| . Is
> FreeBSD is the cause of crash ? Things are pretty blurred right now :(.
> Here is the Kernel tuning values :

Again, no: "application" can not crash kernel. Apart from hardware, only
what runs in kernel context can, e.g. hardware drivers. With your machine
I would first make sure your hardware is sane.

Here is what I would do if it were my box:

1. go to BIOS and make sure temperature thresholds are not too low (even
though this doesn't seem to be your case), remove BIOS hardware memory
hole re-mapping

2. inspect what and how is installed inside the box (e.g. some cards may
not be installed well, not fully engaged into connectors - which doesn't
seem to be your case either)

3. Check that all memory is of the same brand and same type (which likely
to be your problem).

4. Check that RAM, CPUs are in the list of supported by motherboard
manufacturer for this particular motherboard model

5. if not all memory slots are filled, check motherboard manual how to
partially fill memory slots. Basically if memory bus leads are not
terminated, one should first fill farthest from CPU or memory controller
slots (thus avoiding reflection from the end of not terminated
transmission line)

6. leave minimal amount of hardware in the motherboard, and see if the box
doesn't crash. This means: remove all added cards which you can run the
machine without (for testing purpose), remove all CPUs except for CPU #0
which the machine boots off, put minimal amount of RAM (these days memory
controllers are on the CPU substrate, so make sure RAM is plugged into
slots connected to CPU #0). Make sure you removed all components first
then re-install minimal set.

Run memory test (memtest86, you can find bootable CDs which memtest86).


Observe anti-static precautions! I've seen memory chips that were slightly
fried by static discharge. (other electronics components may be like that
too). They were working as if they are good, but at some point later they
started failing. They may be a bit out of specs after static discharge,
which may cause random errors.


What hardware components can cause problems like yours (apart from using
CPUs or RAM that are not supported by motherboard). Motherboard itself
(e.g., micro cracks in some PCB leads), RAM (most likely), CPU, poorly
installed PCI-X (or PCI, PCI-E) cards.

Good luck troubleshooting!

Incidentally, I do have a bunch of supermicro based systemboard boxes
running FreeBSD 9.3 and FreeBSD 10.2, none of them ever crash.

Valeri

>
> http://pastebin.com/nEnxkV6y
>
> Please help us further !!
>
> Regards.
> Shahzaib
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to
> "freebsd-questions-unsubscribe at freebsd.org"
>


++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++


More information about the freebsd-questions mailing list