why would I get a segmentation fault on one system but not the other?

Sun Feb 22 15:48:37 UTC 2015

On Sun, February 22, 2015 4:44 am, David Benfell wrote:
> On Sun, Feb 22, 2015 at 09:19:56AM +0100, Polytropon wrote:
>> On Sat, 21 Feb 2015 17:03:50 -0600, cpet wrote:
>> > As well as don't use stable on a production box as STABLE doesn't mean
>> > what it means.
>>
>> STABLE means that the API/ABI is stable. Unlike HEAD (CURRENT),
>> STABLE still is actually _stable_ in most cases, so it's a valid
>> solution for production systems (given that you're prepared well,
>> and you know what you're doing). I'm running STABLE on few
>> production machines myself (where this is needed), but I usually
>> prefer (and often recommend) using RELEASE and add the security
>> patches when they are available.
>>
> Thinking about this more, I'm inclined to think my problem is not with
> the base system. I haven't seen *any* crashes with stuff that can be
> clearly identified as being in the base system, let alone the kernel.
>
> My memory test has just completed a 4th pass with zero errors. It's
> now been running for 7.5 hours.
>

How long does the box run before segfault? Some memory errors may happen
with smaller probability, then short memtest may be OK, not detecting
memory errors happening less often.

What is the load of machine when segfault happens? During memtest86 the
load is "zero". During actual server run, you may be heating the interior
of the box to higher temperatures, namely memory controller to higher
temperatures, which increases chance of malfunction.

Do you have ECC memory or non-ECC? If non-ECC can you replace it with ECC?
(some memory controllers accept both). Is it possible that you have
mixture of different types of RAM attached to the same memory controller
(I've seen even different brands claiming the same specs did cause
occasional malfunctions). Also, which slots do you use for RAM? If not all
slots have RAM, start filling the slots that are farther away from memory
controller (which is on CPU substrate these days, hence from CPU). If you
leave fartherst slots open you will have open (not terminated) portion of
transmission line causing reflections interfering with signal, leading to
trouble. Some fancy system boards do have memory bus terminators so what I
said about slots deasn't matter for them, but majority of boards do not.
If the hardware is a suspect, I would begin with minimal amount of known
good RAM.

Swapping RAM between good and bad machines is another thing to try. I
however, would try instead to swap hard drives, and see which of machines
will start failing after that. This way you will know for sure if software
(+ hard drive) is to blame (if different machine starts failing) or
hardware (if the same machine with system from good machine keeps
failing).

Goog luck!

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++