Puzzle for Doug...

Robert G. Brown rgb at phy.duke.edu
Tue Jul 28 08:09:57 PDT 1998


On Tue, 28 Jul 1998, Mike Isely wrote:

> On Mon, 27 Jul 1998, Chris Pirih wrote:
> 
> > At 06:08 PM 07/27/1998 -0400, Robert G. Brown wrote:
> > >Two systems ... gave the NMI dazed and confused error
> > >reported by a couple of folks to the list.  I saw a couple of passes
> > >of trying harder, some mindless dumps of registers, and finally the
> > >system hung.  It complains of "maybe power management is on in the
> > >BIOS or bad RAM".  I don't have the former; the latter is checked
> > >twice during boot.  Of course, it could still be bad...
> > 
> > This is almost certainly a memory failure.  Try swapping DIMMs
> > between machines and see if the failure follows the memory or
> > the motherboard.
> 
> Actually, you could extend this technique - here you have multiple
> identical systems with differing behaviors.  Picking a working box and a
> broken box and start trading pieces between them until the problem moves
> across...
> 
> Also, have you (or is it possible on those boxes) tried to completely
> erase the motherboard CMOS and start from a known state?  That's a bit of
> state which might be different between the working & non-working machines. 

Guys, I'll go play the "swap the hardware" game (which I know well but
hate in principle:-) if you can explain why:

a) The systems in question ran flawlessly for weeks under heavy load
when booted diskless and STILL run flawlessly if booted diskless
without the new aic7xxx driver (but with, for example, the old 5.0.19
driver).

b) The systems never complain during boot when their memory is tested
-- twice (and gawd, I hate waiting through both of the tests which
take a fair amount of time with 1/2 GB of RAM).

The only way that I could see the problem being bad memory is if the
SDRAM they put in the systems is somehow marginal and occasionally
fails but ONLY IN A WAY THE AIC7XXX DRIVER TWEAKS!  And only on the
7890 U2W controller.  Given that the driver is brand spanking new and
barely works (in the sense that I have to put the controller in an
"unnatural state" to get it to work at all by disabling its BIOS) the
problem stinks of software.  Thinking about it, this is the kind of
thing one might expect to see if the driver tries to write to the
disabled BIOS area via a memory map, or the like.  I have no idea what
parts of the controller are shut down by disabling its BIOS, but they
are clearly parts that the driver is trying to access or it wouldn't
"fix" the failed Inquiry problem.

Before I spend a full day playing swap-the-chip on systems where the
failure is clearly intermittent and state dependent (so it might not
even show up with the chips swapped, or might go away long enough to
boot once but then come back) it seems sensible to explore the
question of whether or not it is a latency/timing problem or (more
likely) related to the very problem that is being "fixed" by turning
off the BIOS in the first place.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu




To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



More information about the aic7xxx mailing list