Puzzle for Doug...

Robert G. Brown rgb at phy.duke.edu
Tue Jul 28 12:13:47 PDT 1998


On Tue, 28 Jul 1998, Jess Johnson wrote:

> According to Robert G. Brown:
> > Well, I saw the NMI error pop up on ANOTHER of the five systems
> > overnight, although this one recovered.  I have to say that I
> > seriously doubt that 3/5 of Dell's delivered systems have bad memory,
> > especially given that I've run these systems diskless for around 3
> > weeks now "flawlessly" under heavy load of big-memory applications.  A
> > memory problem with any significant probability of occurring (which
> 
> I didn't know you had already been running them. Doesn't sound like memory to me in
> this case. My comment about the bios test was a general "all generic intel pc's" 
> inclusive statement. Some with custom bioses do better tests.

And I agree.  And I actually have had one box fail its memory test one
time -- never before and never again, though.  But read on....

> 
> 
> > clearly must be the case, given that it happens at boot time in low
> > memory) would almost certainly have created havoc -- repeated kernel
> > crashes, bad answers, segment violation errors as loop/jump addresses
> > were corrupted -- none of which have been observed.  The phenomena
> > thus far seems confined to the aic7xxx driver only and moreso to the
> > 7890 device -- I ran the old aic7xxx driver in diskless kernels for a
> > week or so (the one that found the 7860 but not the 7890) and observed
> > none of this.
> 
> Sounds to me like you have covered all the bases so far. No arguments from me here.
> I'm not a strong programmer so I can't do much beyond hardware errors.

A new and interesting data point.  I booted up one of the systems and
it worked perfectly.  I fdisked the disk, made e2fs, mounted them and
installed them.  I powered down to change where it was plugged in and
-- when I rebooted it and all subsequent boots, it exhibits the
error.

I'm suspicious of two things.  First, these systems are in a room with
LOTS of boxes plugged into a few circuits.  If those circuits are
browning out, the systems might have just enough juice to boot but not
quite enough to consistently deliver the inductive load caused when
the disk/controllers reset.  Second, It consistently fails when
probing the disk on the U2W controller.  This makes me suspect a
timing problem on reset.

My tests of the power problem have not been promising -- the problem
doesn't go away when I isolate a system on a presumably unloaded
plug.  

Also, I've tested a kernel made with the aic7xxx reset delay increased
to 15 seconds.  It failed "worse/differently" than one with a five
second reset delay -- instead of generating an aieee message or two and
dying, it loops on:

   Data-Path Ram Parity Error
   PCI Error Detected
(scsi1) SEQADDR = 0x1
(scsi1) BRKADDRINT error 0x50
...  (repeat forever)

Since 15 seconds made things worse, maybe 4 seconds will be better.
Trying a four second reset delay I get:

   Data-Path Ram Parity Error
   PCI Error Detected
(scsi1) SEQADDR = 0x1
(scsi1:0:-1:-1) Yikes!! There is a loop in the free list!
(scsi1) BRKADDRINT error 0x50
...  (repeat forever)

Going back to 5 seconds I continue to get this loop.  Even the method
of failure seems to change between boots.  The original failure is:

(...gets CD-ROM sr0 on scsi0, the 7860 and then...)
(scsi1) BRKADDRINT error 0x10
   Data-Path Ram Parity Error
(scsi1) SEQADDR = 0x1ff
Uhhuh. NMI received.  Dazed and confused....
  (followed by a bunch of stuff up to an Aieee followed by several
idle task cannot sleep messages followed by a scsi aborting command
due to timeout....Inquiry... followed by (AT LONG LAST) the SUCCESSFUL
WD probe returning sda, followed by two scsi aborting Unit Ready
commands followed by a second GPF/Aieee followed by several attempts
to reset the bus and... system death).

BTW, My statistics have gone way up.  Currently I have five systems
running "perfectly".  I have one that ran once and now fails.  I have
a pile of four that have failed or are failing, plus the one that now
fails.  If you like, roughly 50% of these identical systems are
failing at least intermittantly.  I may try to build a kernel (again)
with the aic7xxx driver as a module (not in the kernel) to see if the
problem occurs if the driver is installed "post diskless boot".

> I really like dell machines, particularly their server machines. I have 2 poweredge
> 2100's and 4 2200's and love them. Anytime someone wants to buy a machine and 
> doesn't want to build it themselves I recommend that they purchase a dell. 

As I said, I'm pretty doubtful that Dell's hardware failure rate is
50%, especially on a project with high visibility.  I'm betting on a
significant but intermittant driver bug.  This one is going to be very
hard to debug, though; hopefully some of the stuff Doug was getting
ready to do to overcome the BIOS problem will help here too.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu




To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



More information about the aic7xxx mailing list