Puzzle for Doug...

Robert G. Brown rgb at phy.duke.edu
Wed Jul 29 08:43:54 PDT 1998


On Tue, 28 Jul 1998, Mike Isely wrote:

> ...Except that *no* amount of bad software should EVER be able to cause an
> NMI.
> 
> An NMI is something generated by the motherboard when something goes
> seriously wrong in the hardware.  Back in the good old days, an NMI meant
> a memory parity error.  Now, since nobody uses parity anymore, I have no
> idea, but the type of source for such trouble should be similar.  If

Well, but what about remapping of memory?  If one remaps memory to
point to things in device space (e.g. reading/writing to nonexistant
remapped locations), might not device errors appear to the kernel to
be memory parity errors?

> that's true it should be nearly impossible to cause it in software, even
> deliberately.  That's why people here are talking memory trouble.  I'm
> thinking more like a hardware race going on, which would explain why it
> occurs to you just shortly after the kernel wakes up the aic7xxx hardware
> and starts doing things to it.  The resulting DMAs & bus contention could
> be adding enough bus noise, jitter, power spikes, whatever to push
> whatever is on the edge, over it.  Arbitration issues are some of the
> stickiest problems when designing digital logic.
> 
> You've got a really unique situation in that you have apparently identical
> systems behaving differently.  That's an EXCELLENT set up to chase this
> sort of problem.  I'd almost say it's better than having a logic analyzer
> present because you can compare the two machines and look for differences. 
> There has to be a difference, you just need to uncover it.
> 
> Something very well could be on the edge here.  Perhaps you can vary the
> surrounding temperature (hair dryer, more preferably a can of instant
> cold) and see if that affects the problem.  Unfortunately, if that *is*
> it, I don't see any obvious solution. 

I have the systems I'm working with on a three layer shelf in an air
conditioned computer room.  Ambient temperature is low, maybe 68F.
The systems fail when cold booted.  They fail when warm booted.  Two
systems next to one another can both work, or both fail.  Systems on
the top shelf work and fail, currently systems on the middle shelf
work, and systems on the bottom shelf fail.  All systems are just
exactly the way Dell delivered them, except that I've disabled the
SCSI BIOS.  They fail at exactly the same instruction on all the
systems.  Today I'm going to figure out what that instruction is if I
do nothing else.  Well, I'm gonna TRY to figure out what that
instruction is...

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu




To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



More information about the aic7xxx mailing list