Debugging bad memory problems

jd1008 jd1008 at gmail.com
Mon Apr 27 00:35:35 UTC 2015



On 04/26/2015 06:02 PM, Mehmet Erol Sanliturk wrote:
> On Sun, Apr 26, 2015 at 3:15 PM, Valeri Galtsev <galtsev at kicp.uchicago.edu>
> wrote:
>
>> On Sun, April 26, 2015 4:05 pm, Fernando Apesteguía wrote:
>>> On Sun, Apr 26, 2015 at 10:05 PM, Valeri Galtsev
>>> <galtsev at kicp.uchicago.edu> wrote:
>>>> On Sun, April 26, 2015 12:11 pm, Fernando Apesteguía wrote:
>>>>> Hi,
>>>>>
>>>>> I suspect my old and beloved AMD64 laptop is suffering from bad memory
>>>>> problems: I get random crashes of well tested programs like sh, which,
>>>>> etc even when I executed some of them from /rescue.
>>>> If RAM is a suspect the first thing I would do is re-seat memory
>>>> modules.
>>>> Open the box. (Observe static precautions!) Remove memory modules.
>>>> Install
>>>> them again.
>>>>
>>>> Do memtest86 (by booting into memtest86, you can have that in your boot
>>>> options, or you can boot off external media as others suggested).
>>>>
>>>> If you still have problems: try to run with one memory module instead of
>>>> two. At some point when they went to higher RAM speeds memory bus
>>>> amplifier became more fragile (some chips, some manufacturers, as not it
>>>> is part of CPU, this may be true only about some of the CPU models). You
>>>> sometimes can slightly fry it if you merely leave laptop running on
>>>> battery, letting battery run down and laptop powering off due to that.
>>>> With some of chips this may lead to slightly frying it - memory
>>>> controller
>>>> portion of it, address bus amplifier in particular. Bus amplifier
>>>> becomes
>>>> slightly lower frequency, which results in poorer handling capacitive
>>>> load
>>>> (which is larger if you have more RAM), and it is marginally OK,
>>>> occasionally having address errors. Going to one module may resolve
>>>> this.
>>>> You will know if this is likely the case if memtest86 is successful with
>>>> each of single RAM modules, but fails (in random places, often not
>>>> reproducible) with both.
>>>>
>>>> Good luck!
>>> I booted from a memtest CD-ROM. It passed a couple of tests fine and
>>> then it rebooted while doing a "bit fade" test at around 93%. Removing
>>> the modules is tricky since this laptop has screws all around in dark
>>> corners (even removing the battery needs a screw driver). I will try
>>> to limit physical memory with hw.physmem and see if it makes any
>>> difference.
>> The last will not help against what I mentioned, as capacitive load on
>> memory address bus is defined by what is physically attached to it.
>>
>> One usually runs memtest86 for 24 hours at lest. One loop will catch
>> "solid defects" like adjacent line on the board connected (while they
>> shouldn't). Memory related failures to the contrary are often
>> intermittent. In worst case I've seen, they only manifested under intense
>> load of the box (whereas memtest86 is equivalent to almost zero load).
>>
>> Good luck!
>>
>> Valeri
>>
>> ++++++++++++++++++++++++++++++++++++++++
>> Valeri Galtsev
>> Sr System Administrator
>> Department of Astronomy and Astrophysics
>> Kavli Institute for Cosmological Physics
>> University of Chicago
>> Phone: 773-702-4247
>> ++++++++++++++++++++++++++++++++++++++++
>>
>
>
> Failure may be in memory management circuits instead of memory chips .
> To test this situation , the existing memories may be replaced by memory
> chips that they known to work  ( if it can be done ) .
>
>
> Thank you very much .
>
>
> Mehmet Ero Sanliturk
One slight, and perhaps remote, possibility is that memory
is a hair slower than what the memory controller expects,
especially, as Valerie mentioned, under heavy memory load.
On systems where the cpu clocking is unlocked, one might
be able to slow down the cpu clock just slightly to see if the
problem is mitigated.



More information about the freebsd-questions mailing list