FreeBSD and ECC memory?

Erik Trulsson ertr1013 at student.uu.se
Fri Jul 25 14:04:20 UTC 2008


On Fri, Jul 25, 2008 at 09:28:11AM -0400, Michael Powell wrote:
> Erik Trulsson wrote:
> [snip]
> > 
> > No, non-ECC RAM cannot detect or correct any errors at all. (Old
> > parity-RAM could detect, but not correct, single-bit errors.)
> 
> Actually quite true. The old parity bit functionality that was removed from
> RAM and then called "non-ECC" actually migrated to the memory controller.
> So yes, it isn't the RAM that does it. Poor choice of wording on my part.

Not quite.
Old parity-RAM usually had an extra parity bit for every 8 data bits.
By computing the parity (odd or even number of 1s) in the data bits
and comparing it with the value of the parity bit (which got set when you
wrote to memory) you could see if any single bit had been flipped.
(ECC also uses these extra bits, but uses them in a smarter way.)

Non-ECC RAM (as well as older non-parity RAM) does not have these extra bits
and therefore you cannot detect any spontaneous bit-flips inside the RAM,
since you have nothing to compare the data read against.

(The reason non-ECC RAM is more common than ECC RAM is simply that these
extra bits require extra chips on the memory module and therefore cost more
money - money which most people are not prepared to pay.)
(If you count the number of chips on a non-ECC memory module you will find
that the number of chips on it is usually a multiple of 8, while on ECC- or
parity-RAM it is usually a multiple of 9.)


Many modern memory controllers do have parity checking (or even ECC) on the
busses between controller and RAM and between controller and CPU.  This lets
you detect (or even fix) any errors may happen as data is transferred from
RAM to CPU.  It does not let you detect random errors inside the RAM, which
parity or ECC can let you do.


> 
> > ECC is generally capable of detecting multi-bit errors and fixing
> > single-bit errors. (There are different ways of implementing ECC. Some of
> > them might well be able to fix multi-bit errors too.)
> 
> These cost lots of money. Common on "Big Iron". In fact, non-ECC as an
> option isn't even offerred on "B.I".
>  
> [snip] 
> >> The purpose of these schemes is to compensate for the fact that in every
> >> so many (some large number) of memory transactions there may be a bit
> >> that gets flipped. If this is happening more often than (some large
> >> number) then there is a defect present. ECC just buys you "uptime" in the
> >> event there are more errors than there should be.
> > 
> > Note that random, spontaneous bit flips can happen (infrequently) even in
> > perfectly good RAM. (Due to cosmic rays, radioactive decay in surrounding
> > material, and similar stuff. (No, I am not joking.))  ECC will handle
> > such errors just fine, and that is the main reason why I would want ECC.
> 
> Especially true in satellites. The RAM in a satellite, or other spacecraft
> must be radiation hardened to be usuable at all. And yes, it is no joke but
> the truth what you say.
> 
> For me the dividing line is when lots of people depend on a box 24/7 it must
> be ECC. A storage server in someones basement doesn't necessarily fit into
> this category.

It depends also on what kind of data is stored on the server.  One of the
really nasty problems that can occur with random bit-flips in non-ECC RAM is
that important data can get silently corrupted.  You can get an error in
your database or spreadsheet or payroll data or whatever without noticing
until it is too late (by which time all your backups will probably have this
wrong data too.)  Depending on the data this can be VERY bad, even if it is
a system that is only used occasionally by a few people.

Memory errors which cause the computer to crash can be quite disruptive, but
they are at least easily noticed, and can then be handled.

>  
> > You can also get defective memory modules, but such can usually be
> > detected
> > by running memtest86 or similar.  ECC can usually handle memory modules
> > that have some bits more or less permanently wrong, but such modules
> > should be replaced as soon as possible.
> >
> 
> I agree - I was kind of harping on the "defective" idea. If it's defective
> the manufacturer owes me a replacement, as in yesterday. 

Yes, and in the (luckily fairly uncommon) case that one of the chips on a
memory module suddenly decides to stop working, then ECC can serve the same
purpose as RAID does for disks - it allows the system to keep going until
you have time to replace the broken part. (Which should be done ASAP since
if you get random bit-flips in addition to a broken chip, ECC will not be
able to correct those bits.)


-- 
<Insert your favourite quote here.>
Erik Trulsson
ertr1013 at student.uu.se


More information about the freebsd-questions mailing list