Puzzle for Doug...

Doug Ledford dledford at dialnet.net
Tue Jul 28 15:38:45 PDT 1998


Robert G. Brown wrote:

> Well, I saw the NMI error pop up on ANOTHER of the five systems
> overnight, although this one recovered.  I have to say that I
> seriously doubt that 3/5 of Dell's delivered systems have bad memory,
> especially given that I've run these systems diskless for around 3
> weeks now "flawlessly" under heavy load of big-memory applications.  A
> memory problem with any significant probability of occurring (which
> clearly must be the case, given that it happens at boot time in low
> memory) would almost certainly have created havoc -- repeated kernel
> crashes, bad answers, segment violation errors as loop/jump addresses
> were corrupted -- none of which have been observed.  The phenomena
> thus far seems confined to the aic7xxx driver only and moreso to the
> 7890 device -- I ran the old aic7xxx driver in diskless kernels for a
> week or so (the one that found the 7860 but not the 7890) and observed
> none of this.

When you ran the older driver in this machines was it doing anything or just
sitting there idle?  Secondly, what *speed* was it doing something at.

> It COULD be memory, and of course I'll (sigh) take down a box and see
> if I can improve things (or at least change things) by swapping memory
> out two banks at a time -- if I don't get a more promising response,
> since I really don't think that it IS memory.  You can like Dell or
> not as an "Intel/Microsoft lackey" (as a wag on the linux-smp list is
> fond of calling them) but I really think that they do sell excellent,
> if expensive, hardware.  I'd never expect a memory failure rate in the
> 10-20% range, which is what it would have to be to explain the
> phenomena, and I'd further not expect to see only MARGINAL failures
> instead of out and out won't boot the system period failures.

OK...first off, there are a few things that are true.  

1) You're getting NMI interrupts.  Unless the people at Dell hooked the
aic7xxx chipset into the NMI interrupt line by accident, the aic7xxx
hardware *can't* give this interrupt.  Only the actual PCI chipset is
typically hooked into the NMI.  Even if the aic7xxx driver is righting
garbage over your entire memory space, it would still be writing 32 bit (or
whatever) values to RAM and the system chipset would be generating the
parity/ECC code during the write.  When reading that value back from RAM, if
the parity/ECC code for a particular RAM location don't match the data in
that RAM location, you get an NMI.  The whole process is contained and
localized in the 440BX chipset in your case.  We could do anything we
wanted, scribble on kernel memory all day long, and do all sorts of other
nasty stuff and not be capable of causing an NMI interrupt.  Only an error
between that 440BX and the SDRAM should ever cause an NMI, or even be
capable of it, unless something reprograms the local APIC or the IO-APIC
under 2.1.x SMP.

2) Dell are Microsoft Lackeys, but that doesn't mean they are bad system
builders, just that their systems are tweaked for Microsoft products.  This
includes things like RAM timings.  The systems you bought (if I remember
correctly) are held by Dell as super duper NT servers.  Most likely, the
machines are tweaked for NT usage.  It very well may be that the combination
of DMA loads and CPU memory loads under linux are too much for the NT
tweaked settings to sustain.  It may also be that you are getting hit by
another problem I've *heard* about under linux, but have no personal
experience with.  Namely, I've heard claims that the 440BX chipset systems
with their PC100 SDRAM(8ns) actually are not reliable under linux unless you
use PC100+ SDRAM(7ns).  It's not really called PC100+ SDRAM, but the point
is that the original PC100 SDRAM was 8ns, while for linux to be reliable I
have heard claims that 100Mhz systems need 7ns SDRAM instead.  It's entirely
possible (if even plausible) to think that Dell would go the cheaper route
of 8ns SDRAM when building for NT.

3) Not knowing the exact machine configurations, there could be virtual
address mapping problems due to VM size and kernel offset combinations.  You
might try going back to 256MB RAM and seeing if a problem machine all of a
sudden starts playing nice.  If so, then you'll need to tweak some kernel
headers and build a custom kernel for the 512MB RAM case (I doubt this one,
the stock defines should be good to jsut under 1GB RAM).

4) Yes, it's perfectly reasonable to expect that if the SDRAM is even close
to marginal with the aic7xxx driver not installed, then installing the
driver and using an Ultra2 disk is actually likely to break the RAM stuff. 
These errors can take all kinds of shapes and forms.  However, I would
suspect that the SDRAM in these machines is ECC SDRAM and that ECC error
checking is enabled in the BIOS.  If that's the case, do you even know if
the machine gets single bit errors without the driver?  It's possible that
the machine could have single bit, silently corrected errors in the diskless
mode, and start getting multi-bit errors when the driver is active. 
Personally, I doubt this, but I would also check to make sure the SDRAM is
ECC SDRAM and that the ECC code is enabled in the BIOS.  If you don't have
ECC SDRAM, then I would send it back and tell Dell to put the right SDRAM in
there.  I can't think of a valid reason not to use ECC SDRAM in PII
machines, especially after how the entire SDRAM stuff started out.  I simply
don't trust it without ECC.

Anyway, there's a few things to consider when trying to track this down. 
Good luck on finding it, my personal bet would be first on the 7ns vs. 8ns
SDRAM and then second on ECC issues.

-- 

 Doug Ledford  <dledford at dialnet.net>
  Opinions expressed are my own, but
     they should be everybody's.

To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



More information about the aic7xxx mailing list