Puzzle for Doug...

Robert G. Brown rgb at phy.duke.edu
Mon Jul 27 15:12:47 PDT 1998


Dear List Persons,

An interesting puzzle.

I'm installing a pile of Dell 2300/400's with onboard aic7890 and
aic7860 controllers.  They are all identical: they have the same
ethernet card in the same slots, the same video card in the same
slots, the same amount of SDRAM (512M) and are all just the way Dell
delivered them.  

I can create absolutely identical boot/root environments for them via
a diskless root boot floppy generated from a script.  The script
installs the same kernel for each one, currently 2.0.35 SMP with the
5.1.0pre5 aic7xxx.  This is the driver that, as was discovered last
week, "works" with the 7890 provided that the SCSI BIOS is disabled on
the 7890.

My puzzle is this.  So far I have three systems that, when I booted
from a floppy containing the bzImage.dual aforementioned (after
configuring the 7890 BIOS off) came up "perfectly".  I have
subsequently run fdisk, run a full diskless install, and except for
requiring a bootdisk to boot they run flawlessly.

Two systems that I attempted to install today FAILED the boot.  They
failed in identical ways -- they made it just past the point where the
devices were identified and gave the NMI dazed and confused error
reported by a couple of folks to the list.  I saw a couple of passes
of trying harder, some mindless dumps of registers, and finally the
system hung.  It complains of "maybe power management is on in the
BIOS or bad RAM".  I don't have the former; the latter is checked
twice during boot.  Of course, it could still be bad...

The interesting thing is that this appears to be stable.  Power
cycling, tweaking the BIOS (not that Dell gives you much to tweak) and
repeated reboots fail on these systems where they consistently succeed
under IDENTICAL conditions with IDENTICAL hardware IDENTICALLY
configured for their bretheren.  Gives an acute headache to those of
us who want to believe in a deterministic universe...:-)

Anyway, this sounds to my untrained ear like some sort of critical
timing issue (or of course broken hardware in 2/5 boxes).  Just
thought I'd pass the word on -- just because you don't see a reported
failure for a given hardware configuration does not mean that it won't
occur even if all things really ARE equal.  If anyone has any ideas,
I'd be happy to try them out tomorrow.  If anybody knows where to find
a critical timing thing (or some other thing that might vary on
identical boxes when they are booted) in the code, you might check
that part.


   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu




To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



More information about the aic7xxx mailing list