Many SCSI errors

Thu Jul 8 10:00:00 PDT 2004

That 3rd disk was changed. Now it's running ok on another box.
But that problem was solved with a entirely new Promise tower
now with 1.1.0.30 firmware.

Since your last mail i've been testing the tower in many ways.
I've connected the tower to a and old PC running webpam to log
the possible errors, but it doesn´t detect this transmission 
error. On most hangs the tower seems to be working ok, we have
to reboot the host, only one time we have to reboot both to
reconnect.

The tests:

1)We tried slowering the speed of the SCSI bus (from 320 to 20!)
  then run badblocks against the entire RAID5 of 6 disks but it
  fails and generates the same "Transmission error detected".

2) We configure the tower like JBOD and then run badblocks for
   every single HD (from 1 to 6). It runs ok on 1,2,3 and 4 HD
   then it generates the Transmission error with HD5. We have to
   reboot the host and the tower. Then run it again and its OK,
   over HD6 ok too (but suffering hangs with the same messages).
   So the conclusion is that the HD are ok, the problem is ...
   Transmission (cable, SCSIcards, terminators) or software:
   SCSI driver / firmware tower firmware.

3) We change SCSI cable. Test again using badblocks but same 
   error happens "transmision error detected" and out.

4) Lets try something more radical. You know my hardware is an
   INTEL SE7501BR2 (AIC-7901 on board) with RH Linux 9 (2.40.20)
   using aic79xx-2.0.10-rh90.i686.rpm for the SCSI. We disabled 
   the onboard SCSI and add an old Adaptec AHA-2049U. Kudzu
   detects it and load the aic7xxx module. We can see the tower
   so we test again with badblocks the 6 HD and everything
   works fine!!! (but its a little sloooooow). That card has
   a max 10MB/S rate. What is the conclusion now? May be it's
   the onboard SCSI, may be it's the aic7xxx that works different
   than the aic79xxx module, may be it will fail next minute,I 
   don't know, but i don't want this like "the solution".

I've send two messages to Promise support but no response yet.

I haven't many options to test. The next things we will try are:

- Install all the firmware upgrades of the motherboard and test

- Try to run the on board SCSI AIC-7901 with the aic7xxx module
  (is that somehow possible?)

- Install an old card that use aic7xxx module but faster than 
  the 2049. (I don't like this, but...)

- Finally: Install some windows OS on the host and test. (This
  is my laaaaast chance, and i hope i can find a Linux solution
  to this problem). 

I'm thinking that the problem is not the tower or the firmware 
itself, but the way the linux driver "talks" with the promise 
firmware. This is only a conclusion from my tests, i'm not a
driver programmer and i can't go and see the code... but on the
other hand you are having weird symptoms like the one you tell
me with the HD removal... i feel lost in the fog.

Hope this helps you (and me).

PABLO

> -----Mensaje original-----
> De: Todd Denniston [mailto:Todd.Denniston at ssa.crane.navy.mil]
> Enviado el: jueves 8 de julio de 2004 11:57
> Para: Petriz, Pablo
> Asunto: Re: Many SCSI errors
> 
> 
> "Petriz, Pablo" wrote:
> > 
> > Hello Todd
> > 
> > > De: Todd Denniston
> > >
> <SNIP>
> > You have 2 RM8000 and if i've understood ok, one works fine and
> > the other doesn't. ¿What's the difference between the two?
> > 
> > This tower is driving me nuts. We bought it in december and it
> > works fine for 1 or 2 weeks till we turn it off, then it began
> > to rebuild the 3rd. disk. We change it, rebuild the new one,
> > everething seems to be ok, but turn off, turn on and rebuild
> > again. 
> <SNIP>
> Do you still have the disk (that was 3rd at the time)?
> Is that disk still setting physically in the Promise array?
> 
> The reason I ask is, 12 days ago I removed from the array a 
> drive which I know
> to be bad [1].  I know it should not have made any difference 
> though, because
> the drive was only physically in the array, it was not locked 
> in so there
> should not have been power or communications to it.  Since I 
> have removed it,
> I put the system in a configuration where before it would 
> last ~16 hours max
> before lock up, and yet it has been running for 12 days. The 
> only change is
> the physical removal of the bad drive!
> 
> It would both thrill me and make me mad to find out that a 
> drive just setting
> in the array with no power could cause these problems!
> 
> 
> [1] at least from the perspective of the badblocks program.
> 
> -- 
> Todd Denniston
> Crane Division, Naval Surface Warfare Center (NSWC Crane) 
> Harnessing the Power of Technology for the Warfighter
>