HELP

Fri Feb 13 13:27:04 PST 1998

Doug Ledford wrote in a message to Mike Bilow:

 DL> Whether or not is should negotiate at 5 or 10 or whatever
 DL> MHz isn't what concerns me (I don't have one of these drives
 DL> to know what it is suppossed to negotiate at).  However,
 DL> there were no errors generated and it negotiated properly
 DL> (assuming that either the tape drive or the controller
 DL> limited the rate to 6.67, if the tape drive shouldn't
 DL> negotiate at this rate, then the SCSI BIOS device settings
 DL> need checked).
* * *
 DL> I didn't say they were normal behavior, and I would like to
 DL> track them down, but they normally aren't anything to worry
 DL> about as we will immediately retry the command after this
 DL> error and the sequencer won't let us corrupt kernel memory
 DL> during the overrun.  I have to assume that an overrun during
 DL> a data out phase also isn't corrupting the hard drive, or
 DL> else it would be the drives firmware at fault (we can't
 DL> force a drive to take more than it wants, if it thinks that
 DL> it has completed the transfer and we don't, then you get an
 DL> underrun instead of an overrun, you should only get an
 DL> overrun when the drive thinks it isn't done while we think
 DL> it is, in those cases bogus data may get written to a
 DL> portion of the hard drive, and then because of the retry, it
 DL> immediately gets re-written with the correct data).  

I won't quote back too much of your message, but I think the above is enough to
keep us on the track.

My basic thesis is that the tape drive motors are pulling down the +12 VDC line
and causing the drive's fail-safe sensors to cut in.  There are likely
inductive spikes floating all over the place as a result, and some of them
could be reaching the bus.  If this happens, it will appear as if the SCSI bus
just goes away intermittently, affecting all of the connected peripherals
including the hard drive.  In a well designed system, the hard drive will
behave as if it is repeatedly recovering from  a "loose cable" problem, and you
will see data overruns, underruns, aborted commands, and even bus resets.

 DL> If the tape drive wedged itself, it wouldn't matter.  The
 DL> above error messages indicate the hard disk.  There is no
 DL> indication of a full bus reset ever being issued, so I can
 DL> only assume that the condition was corrected via the use of
 DL> the abort call.  In that case, the worst that would happen
 DL> would be a bus device reset of the hard disk and nothing to
 DL> the tape drive.  
* * *
 DL> That error message isn't about the tape drive, it's about
 DL> the hard disk.  The condition of the tape drive is
 DL> irrelevant in this situation since the particular error is a
 DL> reconnection from the hard drive, which couldn't happen if
 DL> the bus was wedged, and then a failure to find the command
 DL> associated with that reconnection.

That's my error: I misread the PUN.  In any case, the point is that the bus is
going away for brief intervals and then coming back, and this is causing
aborted commands after timeouts.  The fact that the condition never reached a
bus reset, as you point out, is an indication that burst noise is causing
interference on the bus, and I strongly suspect that the source of that burst
noise is motor spikes from the tape drive.

 DL> Try every Seagate hard drive in existence uses 12V power. So do 
 DL> HP drives, and Quantum hard drives.  To my knowledge, I have 
 DL> yet to find a single SCSI hard drive that doesn't use +12V for 
 DL> the spindle motor.

Many of the lower RPM drives use +5 VDC for even the spindle these days.  The
reason is that it results in much lower power dissipation at the motor speed
controller.  You still need +12 VDC for the higher RPM drives, and the more
expensive drives still do this because it is more reliable.  The trade-off is
that you get greater range of control with higher voltage, but the power
dissipation varies as the square of the voltage drop through the regulator, and
the resulting thermal penalty is very severe.

 DL> If the Archive drives are more sensitive to the +12V level
 DL> than most hard drives, then this indeed could be the case. 
 DL> Most hard drives will spin down their spindles at around
 DL> 11.1 to 11.2 volts.  The rated tolerance for voltage on
 DL> those drives is typically between 11.4 - 12.6V.  So, they
 DL> have a little better tolerance than the rated +-5%.  If the
 DL> archive is just as tolerant of voltage specs as the hard
 DL> drives, then I would seriously doubt this as a possibility. 
 DL> I still would recommend the first course of action being to
 DL> try a new tape (I said tape drive in my original email,
 DL> which was an ooops, I meant to just try a new tape in the
 DL> same tape drive, possibly even one that you have taken a
 DL> bulk eraser to recently just to make sure it's clean).

Well, yes, there is certainly the possibility that the tape jam sensors are
being set off by a real tape jam!  If we failed to make that clear, we should
have.  I just assumed that this happened in more than an isolated case.

The voltage regulation is not a matter of pure tolerance.  The power supply has
some given amount of current capacity.  If this is exceeded, then either the
power supply will completely shut down (in the case of a more expensive power
supply) or the voltage will go below tolerance.

Generally, hard drives start turning when powered up and keep turning until
powered down.  Although it can take as much as 4 A on the +12 VDC line to start
one of the 5.25-inch FH hard drives, even they will pull at most a third of
that once started.  Tape drives, on the other hand, have a whole bunch of
motors used for positioning which start and stop frequently, and these will
demand high current in order to start.

To give an example, I had an HP Netserver -- certainly no cheap machine --
configured with two 5.25-inch HH Seagate hard drives and an Archive Python, and
the machine would consistently shut down during the power-up sequence.  With
only two hard drives and one tape drive, the tape drive had to be reconfigured
(by disabling its power-on self-check) in order to allow the machine to boot at
all.  With a higher quality power supply as was in an NP Netserver, it was
possible to be confident that the voltage would be stable once started as long
as the machine did not actually shut down.  With a lower quality power supply,
the voltage would have just swung out of tolerance.

When the Archive Python detects excessive current flow in its positioning
motors, it will instantly snap off motor power.  This instantaneous snap will
cause an inductive spike that can backfeed to almost anywhere, inclduing the
SCSI bus.  There is no safe alternative, since at this point the goal is to
make sure the motor does not get hot enough to melt something.

 DL>   Opinions expressed are my own, but
 DL>      they should be everybody's.

:-)

-- Mike

To Unsubscribe: send mail to majordomo at FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message