Puzzle for Doug...

Mike Isely isely at enteract.com
Tue Jul 28 17:17:43 PDT 1998


On Tue, 28 Jul 1998, Robert G. Brown wrote:

> On Tue, 28 Jul 1998, Mike Isely wrote:
> 
> > Well since the aic7xxx hardware executes DMA on its own behalf, that sort
> > of memory access might look "different" enough to the hardware to expose a
> > latent race condition.  Certainly there's more memory contention going on
> > with the aic7xxx stuff in the picture. 
> 
> Good point.  I also am wondering if the high speed of the CPU's, the
> memory and the U2 controller itself combine to reveal a race
> condition.  I just really believe that the race is in the driver.

...Except that *no* amount of bad software should EVER be able to cause an
NMI.

An NMI is something generated by the motherboard when something goes
seriously wrong in the hardware.  Back in the good old days, an NMI meant
a memory parity error.  Now, since nobody uses parity anymore, I have no
idea, but the type of source for such trouble should be similar.  If
that's true it should be nearly impossible to cause it in software, even
deliberately.  That's why people here are talking memory trouble.  I'm
thinking more like a hardware race going on, which would explain why it
occurs to you just shortly after the kernel wakes up the aic7xxx hardware
and starts doing things to it.  The resulting DMAs & bus contention could
be adding enough bus noise, jitter, power spikes, whatever to push
whatever is on the edge, over it.  Arbitration issues are some of the
stickiest problems when designing digital logic.

You've got a really unique situation in that you have apparently identical
systems behaving differently.  That's an EXCELLENT set up to chase this
sort of problem.  I'd almost say it's better than having a logic analyzer
present because you can compare the two machines and look for differences. 
There has to be a difference, you just need to uncover it.

Something very well could be on the edge here.  Perhaps you can vary the
surrounding temperature (hair dryer, more preferably a can of instant
cold) and see if that affects the problem.  Unfortunately, if that *is*
it, I don't see any obvious solution. 


> 
> > Such memory tests never amount to more than a quickie existence check.
> > "Leaky" DRAM cells (if such a thing could happen) can't be picked up
> > for example because it would take many many microseconds for the bit(s) to
> > go bad.  BIOS memory scans run way too fast for that.
> 
> Again, if it were "raw" bad DRAM, the system simply wouldn't work
> regardless of the presence/absence of the aic7xxx driver.  Something
> else would be using the critical memory during boot and fail.  I like
> your DMA/race/contention hypothesis below much better.
> 
> > 
> > > 
> > > The only way that I could see the problem being bad memory is if the
> > > SDRAM they put in the systems is somehow marginal and occasionally
> > > fails but ONLY IN A WAY THE AIC7XXX DRIVER TWEAKS!  And only on the
> > 
> > Without any DMA devices active in the system, the memory activity is going
> > to be limited to whatever the CPU causes.  Is there any known-DMA going on
> > without the aic7xxx running?  With multiple independant (fast) devices
> > initiating memory access, all sorts of contention issues can arise.  Of
> > course, this is supposed to work, but without the aic7xxx stuff active you
> > might not be beating on it hard enough to cause the trouble.  Remember the
> > RZ1000 IDE problem a few years back?
> 
> Yeah, this occurred to me -- I have an eepro100 in the system and
> there is indeed network traffic, especially during diskless boots.
> It's harder to see this as a problem in NON-diskless boots, though.
> Also, the network device is formally probed and initialized only AFTER
> the scsi device.  Finally, I unplugged the cable during a boot or two
> so that it wan't actually receiving packets during boot.  No effect.
> Still, a definite possibility.
> 
> > 
> > Just fishing for ideas for ya.  I think a game of musical hardware is
> > definitely the next step here.  But even that may not give conclusive
> > results if something in Dell's configuration is "right on the edge". 
> 
> And I appreciate it!  But *moan*...
> 
>     rgb
> 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> 

                        |         Mike Isely          |     PGP fingerprint
    POSITIVELY NO       |                             | 03 54 43 4D 75 E5 CC 92
 UNSOLICITED JUNK MAIL! |   isely @ pobox (dot) com   | 71 16 01 E2 B5 F5 C1 E8
                        |   (spam-foiling  address)   |




To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



More information about the aic7xxx mailing list