Puzzle for Doug...

Mike Isely isely at enteract.com
Tue Jul 28 10:51:36 PDT 1998


On Tue, 28 Jul 1998, Robert G. Brown wrote:

> On Tue, 28 Jul 1998, Mike Isely wrote:
> 
> > On Mon, 27 Jul 1998, Chris Pirih wrote:
> > 
> > > At 06:08 PM 07/27/1998 -0400, Robert G. Brown wrote:
> > > >Two systems ... gave the NMI dazed and confused error
> > > >reported by a couple of folks to the list.  I saw a couple of passes
> > > >of trying harder, some mindless dumps of registers, and finally the
> > > >system hung.  It complains of "maybe power management is on in the
> > > >BIOS or bad RAM".  I don't have the former; the latter is checked
> > > >twice during boot.  Of course, it could still be bad...
> > > 
> > > This is almost certainly a memory failure.  Try swapping DIMMs
> > > between machines and see if the failure follows the memory or
> > > the motherboard.
> > 
> > Actually, you could extend this technique - here you have multiple
> > identical systems with differing behaviors.  Picking a working box and a
> > broken box and start trading pieces between them until the problem moves
> > across...
> > 
> > Also, have you (or is it possible on those boxes) tried to completely
> > erase the motherboard CMOS and start from a known state?  That's a bit of
> > state which might be different between the working & non-working machines. 
> 
> Guys, I'll go play the "swap the hardware" game (which I know well but
> hate in principle:-) if you can explain why:
> 
> a) The systems in question ran flawlessly for weeks under heavy load
> when booted diskless and STILL run flawlessly if booted diskless
> without the new aic7xxx driver (but with, for example, the old 5.0.19
> driver).

Well since the aic7xxx hardware executes DMA on its own behalf, that sort
of memory access might look "different" enough to the hardware to expose a
latent race condition.  Certainly there's more memory contention going on
with the aic7xxx stuff in the picture. 

> 
> b) The systems never complain during boot when their memory is tested
> -- twice (and gawd, I hate waiting through both of the tests which
> take a fair amount of time with 1/2 GB of RAM).

Such memory tests never amount to more than a quickie existence check.
"Leaky" DRAM cells (if such a thing could happen) can't be picked up
for example because it would take many many microseconds for the bit(s) to
go bad.  BIOS memory scans run way too fast for that.

> 
> The only way that I could see the problem being bad memory is if the
> SDRAM they put in the systems is somehow marginal and occasionally
> fails but ONLY IN A WAY THE AIC7XXX DRIVER TWEAKS!  And only on the

Without any DMA devices active in the system, the memory activity is going
to be limited to whatever the CPU causes.  Is there any known-DMA going on
without the aic7xxx running?  With multiple independant (fast) devices
initiating memory access, all sorts of contention issues can arise.  Of
course, this is supposed to work, but without the aic7xxx stuff active you
might not be beating on it hard enough to cause the trouble.  Remember the
RZ1000 IDE problem a few years back?

Just fishing for ideas for ya.  I think a game of musical hardware is
definitely the next step here.  But even that may not give conclusive
results if something in Dell's configuration is "right on the edge". 


                        |         Mike Isely          |     PGP fingerprint
    POSITIVELY NO       |                             | 03 54 43 4D 75 E5 CC 92
 UNSOLICITED JUNK MAIL! |   isely @ pobox (dot) com   | 71 16 01 E2 B5 F5 C1 E8
                        |   (spam-foiling  address)   |


To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



More information about the aic7xxx mailing list