Debugging bad memory problems

Valeri Galtsev galtsev at kicp.uchicago.edu
Sun Apr 26 22:15:49 UTC 2015


On Sun, April 26, 2015 4:05 pm, Fernando Apesteguía wrote:
> On Sun, Apr 26, 2015 at 10:05 PM, Valeri Galtsev
> <galtsev at kicp.uchicago.edu> wrote:
>>
>> On Sun, April 26, 2015 12:11 pm, Fernando Apesteguía wrote:
>>> Hi,
>>>
>>> I suspect my old and beloved AMD64 laptop is suffering from bad memory
>>> problems: I get random crashes of well tested programs like sh, which,
>>> etc even when I executed some of them from /rescue.
>>
>> If RAM is a suspect the first thing I would do is re-seat memory
>> modules.
>> Open the box. (Observe static precautions!) Remove memory modules.
>> Install
>> them again.
>>
>> Do memtest86 (by booting into memtest86, you can have that in your boot
>> options, or you can boot off external media as others suggested).
>>
>> If you still have problems: try to run with one memory module instead of
>> two. At some point when they went to higher RAM speeds memory bus
>> amplifier became more fragile (some chips, some manufacturers, as not it
>> is part of CPU, this may be true only about some of the CPU models). You
>> sometimes can slightly fry it if you merely leave laptop running on
>> battery, letting battery run down and laptop powering off due to that.
>> With some of chips this may lead to slightly frying it - memory
>> controller
>> portion of it, address bus amplifier in particular. Bus amplifier
>> becomes
>> slightly lower frequency, which results in poorer handling capacitive
>> load
>> (which is larger if you have more RAM), and it is marginally OK,
>> occasionally having address errors. Going to one module may resolve
>> this.
>> You will know if this is likely the case if memtest86 is successful with
>> each of single RAM modules, but fails (in random places, often not
>> reproducible) with both.
>>
>> Good luck!
>
> I booted from a memtest CD-ROM. It passed a couple of tests fine and
> then it rebooted while doing a "bit fade" test at around 93%. Removing
> the modules is tricky since this laptop has screws all around in dark
> corners (even removing the battery needs a screw driver). I will try
> to limit physical memory with hw.physmem and see if it makes any
> difference.

The last will not help against what I mentioned, as capacitive load on
memory address bus is defined by what is physically attached to it.

One usually runs memtest86 for 24 hours at lest. One loop will catch
"solid defects" like adjacent line on the board connected (while they
shouldn't). Memory related failures to the contrary are often
intermittent. In worst case I've seen, they only manifested under intense
load of the box (whereas memtest86 is equivalent to almost zero load).

Good luck!

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++


More information about the freebsd-questions mailing list