2940UW problems.

Mon Jun 21 05:26:52 PDT 1999

James O'Kane wrote:
> 
> I hope this list is for linux and freebsd and that I'm not asking the
> wrong people. CC me on any related reply as I'm not currently subscribed.
> 
> It seems that we have been having trouble with a 2940UW card with linux,
> and it is our best guess that it is being sent too many requests and one
> is being dropped by someone, the drivers get confused and reset the bus
> and start over again. The errors that we get are similar to the following,
> but we would have these errors on 3 different 2940 cards, 3 different
> machines, 3 different classes of machines, PPro, PII, P200.
> 
> ---error----
> Jun 16 18:07:08 server9 kernel: (scsi1:0:2:-1) Unexpected busfree,
> LASTPHASE = 0
> xa0, SEQADDR = 0x155

This is usually because the device saw too many parity errors and decided to
disconnect by simply dropping the bus.

> Jun 16 18:07:08 server9 kernel: (scsi1:0:2:0) No active SCB for
> reconnecting tar
> get - Issuing BUS DEVICE RESET.
> Jun 16 18:07:08 server9 kernel: (scsi1:0:2:0)       SAVED_TCL=0x20,
> ARG_1=0xff,
> SEQADDR=0xfc

This is because we saw the busfree and blew the command away in the driver. 
So, when the device tries to reconnect and start the command up again, we
can't find it so we reset the device (this is because currently in my driver I
haven't looked up all the conditions under which a busfree is suppossed to
result in a simple disconnect without state save and where a busfree is
suppossed to mean blow this command away, so I blow them all away at the
moment, which is a safe thing to do since if the device didn't blow it away
then we wait for the device to attempt to reconnect and then do the bus device
reset just like you see here).

> Jun 16 18:07:12 server9 kernel: (scsi1:0:2:0) Synchronous at 40.0
> Mbyte/sec, off
> set 8.
> Jun 16 18:07:12 server9 kernel: (scsi1:0:4:0) Synchronous at 40.0
> Mbyte/sec, off
> set 8.
> Jun 16 18:07:12 server9 kernel: (scsi1:0:3:0) Synchronous at 40.0
> Mbyte/sec, off
> set 8.
> Jun 16 18:07:13 server9 kernel: (scsi1:0:5:0) Synchronous at 40.0
> Mbyte/sec, off
> set 8.

So the device ends up renegotiating.  It actually looks like the bus reset pin
got pulled since all the devices are renegotiating, but that wasn't in the
logs, so I'm inclined to think there are some logs missing...

> ---end error---
> 
> There are also often time a SCSI abort and timeout while resetting the
> bus, but those didn't make it into the error logs. In doing some searches
> on www.deja.com I noticed that some people suggested putting no_reset in
> the lilo config, but that doesn't seem to be a solution the way it's
> presented, it only seems to hide the symptoms.
> 
> While looking at /proc/scsi/aic7xxx/1 we noticed that it contained this
> line:                    SCBs: Active 0, Max Active 4,
> 
> I don't claim to totally understand SCB's, but our theory is that our
> software (raid 5) is producing more SCB's than the drivers are setup for
> and one is getting lost by someone. Is there a reason that the max active
> is 4? This chain has 5 drives on it. Is it safe to increase the SCB max
> value or are we guessing wrong at the problem?

The Max Active SCBs is not a limit, it's an actual count.  This is the most
SCBs that the system has had active at one time.  This would imply that you
don't have tagged command queueing enabled on your drives or else this number
would be much higher.  Anyway, that's not the problem.  In your case it sounds
like the bus is underterminated (aka, you have termination enabled at one end
of the bus or the other, but not both like it should be).  That will commonly
cause the parity problems that result in the busfree conditions you are
seeing.

-- 
  Doug Ledford   <dledford at redhat.com>
   Opinions expressed are my own, but
      they should be everybody's.

To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message