What's an appropriate response to "data overrun detected in Data-in phase"?

Wed Oct 1 22:20:51 PDT 2003

On Wed, Oct 01, 2003 at 15:40:39 -0700, David Wolfskill wrote:
> I am trying to set up a box to act as a "backup" server (i.e., a
> server to perform backups -- not a server to fill the breach left
> when an active server falls over, for example).
> 
> I note in passing that I have done this sort of thing previously
> (with some success); but my attempts to date in the current sequence
> have met with decidedly unsatisfactory results.  :-(
> 
> Some things that are common to the various attempts:
> 
> * There is a "combination" AIT-3 tape drive & autoloader as the only
>   devices on the SCSI bus.  (The word "combination" in quotes because
>   although it looks like a single box, it contains 2 different SCSI
>   targets -- the drive is target 6; the autoloader/robot/changer is
>   target 0.)  Well, the SCSI host adapter is also on the SCSI bus.
> 
> * The tape drive/autoloader has no options for internal termination;
>   rather, it has 2 SCSI connectors; one of these is connected to the
>   cable; the other, to a SCSI terminator.
> 
>   (Thus, bearing in mind that each 8-bit data channel on each SCSI bus
>   should be terminated precisely once at each of its 2 ends, termination
>   ought not be too complicated.  Or so I thought.)
> 
> * The tape drive/autoloader claims to be "autosensing SE/LVD".
> 
> 
> Now some variables, with the most recently-used option last:
> 
> * I have tried using both an UltraSPARC 5 (sparc64), running -CURRENT
>   as of a few days ago, a similar UltraSPARC 5 running Solaris 9 (but
>   I had no user-level programs to drive the hardware; unless OpenBoot's
>   "probe-scsi-all" reported the devices, then, I had no assurance that
>   the devices were seen).  Most recently, I installed FreeBSD 4.9-RC1
>   on a dual-CPU PII-400 box (then built a slightly-customized kernel,
>   to take advantage of the other CPU and to support the use of the
>   SCSI changer device)).  It is this last that I will be discussing
>   in the text below, though it is my recollection that I got similar
>   behavior from the UltraSPARC 5 running -CURRENT.
> 
> * SCSI host adaptors... I have been trying a few of these.  There was an
>   Antares P-0060 (which, though differential, appears to not be LVD).
>   The others were Adaptec:  an AHA-2940UW; a couple of SE AAA-131Bs; a
>   39160 (though the only PCI slots I have available are 32-bit ones) and
>   finally, an SE/LVD AAA-131B.
> 
> In the course of working with various SCSI cards, I've become rather
> skeptical of the "automatic termination".  I expect it's probably better
> than it was several years ago, but I'm a little more comfortable
> specifying it explicitly.  Thus, given the topology of the bus in
> question, I set the termination on the card to "on" (or, in one case
> where the options were "off" or "auto," I left them at "auto").  Note
> that the Antares card has resistor packs.
> 
> Now, I'll (finally) get to my symptoms....
> 
> frecnocpc6# chio params
> /dev/ch0: 8 slots, 0 drive, 1 picker
> /dev/ch0: current picker: 0
> frecnocpc6#

That looks wrong.  What sort of changer has no drives?

Are you sure this changer is in changer mode, and not in autoloader mode?
Sometimes there are jumpers to choose between them.  If you're in
autoloader mode, you likely won't be able to use the changer device.  (They
probably shouldn't even make the changer target selectable in that case.)

If it truly has no drives, how are you going to tell it to put the tape in
the drive?

> That works OK.  Because of the errors I get below, I recompiled the
> kernel, specifying the CAM debugging options and increasing the kernel
> buffer from 10 pages to 40; here's what pops up on the (serial) console
> when I did that:
> 
> (ch0:ahc0:0:6:0): entering cdgetccb
> (ch0:ahc0:0:6:0): xpt_schedule
> (ch0:ahc0:0:6:0): xpt_setup_ccb
> (ch0:ahc0:0:6:0): xpt_action
> (ch0:ahc0:0:6:0): . CDB: 1a 8 1d 0 20 0 
> (ch0:ahc0:0:6:0): ahc_action
> (ch0:ahc0:0:6:0): ahc_done - scb 9
> (ch0:ahc0:0:6:0): xpt_done
> (ch0:ahc0:0:6:0): camisr
> (ch0:ahc0:0:6:0): xpt_action
> (ch0:ahc0:0:6:0): . CDB: 1a 8 1f 0 20 0 
> (ch0:ahc0:0:6:0): ahc_action
> (ch0:ahc0:0:6:0): ahc_done - scb 2
> (ch0:ahc0:0:6:0): xpt_done
> (ch0:ahc0:0:6:0): camisr
> (ch0:ahc0:0:6:0): entering chioctl
> (ch0:ahc0:0:6:0): trying to do ioctl 0x40086306
> (ch0:ahc0:0:6:0): entering chioctl
> (ch0:ahc0:0:6:0): trying to do ioctl 0x40046304

Normal mode sense it looks like.  One note -- it looks like you may have
SCSI_NO_OP_STRINGS turned on...you generally don't want to do that, it
makes it more difficult to decode error messages.  (The same goes for
SCSI_NO_SENSE_STRINGS.)

> Now, for an error condition:
> 
> frecnocpc6# chio status
> chio: /dev/ch0: CHIOGSTATUS: Input/output error
> frecnocpc6# 
> 
> and the corresponding console messages:
> 
> (ch0:ahc0:0:6:0): entering cdgetccb
> (ch0:ahc0:0:6:0): xpt_schedule
> (ch0:ahc0:0:6:0): xpt_setup_ccb
> (ch0:ahc0:0:6:0): xpt_action
> (ch0:ahc0:0:6:0): . CDB: 1a 8 1d 0 20 0 
> (ch0:ahc0:0:6:0): ahc_action
> (ch0:ahc0:0:6:0): ahc_done - scb 9
> (ch0:ahc0:0:6:0): xpt_done
> (ch0:ahc0:0:6:0): camisr
> (ch0:ahc0:0:6:0): xpt_action
> (ch0:ahc0:0:6:0): . CDB: 1a 8 1f 0 20 0 
> (ch0:ahc0:0:6:0): ahc_action
> (ch0:ahc0:0:6:0): ahc_done - scb 2
> (ch0:ahc0:0:6:0): xpt_done
> (ch0:ahc0:0:6:0): camisr
> (ch0:ahc0:0:6:0): entering chioctl
> (ch0:ahc0:0:6:0): trying to do ioctl 0x40086306
> (ch0:ahc0:0:6:0): entering chioctl
> (ch0:ahc0:0:6:0): trying to do ioctl 0x800c6308
> (ch0:ahc0:0:6:0): entering cdgetccb
> (ch0:ahc0:0:6:0): xpt_schedule
> (ch0:ahc0:0:6:0): xpt_setup_ccb
> (ch0:ahc0:0:6:0): xpt_action
> (ch0:ahc0:0:6:0): . CDB: b8 0 0 84 0 1 0 0 4 0 0 0 
> (ch0:ahc0:0:6:0): ahc_action
> (ch0:ahc0:0:6:0): ahc_done - scb 9
> (ch0:ahc0:0:6:0): xpt_done
> (ch0:ahc0:0:6:0): camisr
> (ch0:ahc0:0:6:0): xpt_action

It successfully reads the element status when it allocates 1K for the
buffer.

> (ch0:ahc0:0:6:0): . CDB: b8 0 0 84 0 1 0 0 0 20 0 0 
> (ch0:ahc0:0:6:0): ahc_action
> (ch0:ahc0:0:6:0): data overrun detected in Data-in phase.  Tag == 0x2.
> (ch0:ahc0:0:6:0): Have seen Data Phase.  Length = 32.  NumSGs = 1.
> sg[0] - Addr 0x01367c180 : Length 32
> (ch0:ahc0:0:6:0): ahc_done - scb 2
> (ch0:ahc0:0:6:0): xpt_done
> (ch0:ahc0:0:6:0): camisr
> (ch0:ahc0:0:6:0): xpt_action

But not when it allocates 32 bytes for the buffer.  My guess is that it is
trying to send back more data than we've asked for...

Can you try doing the following, and send me the result:

camcontrol cmd ch0 -v -c "b8 0 0 84 0 1 0 0 4 0 0 0" -i 1024 - > tmpfile
hd tmpfile

The ch(4) driver does calculations based on the data the changer sends back
in response to the first READ ELEMENT STATUS (the one with the 1K buffer)
and then attempts to just read the amount of data that the changer claims
it needs in the second try.

The changer is probably trying to send back more data than we've asked for.
Maybe the dump of what it sends back in response to the 1K request will
help me figure out what's going on.

> (ch0:ahc0:0:6:0): . CDB: b8 0 0 84 0 1 0 0 0 20 0 0 
> (ch0:ahc0:0:6:0): xpt_setup_ccb
> (ch0:ahc0:0:6:0): xpt_action
> (ch0:ahc0:0:6:0): ahc_action
> (ch0:ahc0:0:6:0): data overrun detected in Data-in phase.  Tag == 0x9.
> (ch0:ahc0:0:6:0): Have seen Data Phase.  Length = 32.  NumSGs = 1.
> sg[0] - Addr 0x01367c180 : Length 32
> (ch0:ahc0:0:6:0): ahc_done - scb 9
> (ch0:ahc0:0:6:0): xpt_done
> (ch0:ahc0:0:6:0): camisr
> (ch0:ahc0:0:6:0): xpt_setup_ccb
> (ch0:ahc0:0:6:0): xpt_action
> 
> 
> All of which I find rather perplexifying -- what can I do about it?

First off, see if the changer happens to be in autoloader mode.  (If it has
one.)  The fact that it reports no drives is a potential sign of trouble.

Then send me the dump of the read element status data above.  (i.e. that
camcontrol command.)  If you're feeling ambitious you can try to figure out
whether the ch(4) driver is getting things wrong in chgetelemstatus() or
whether the drive is just sending back more data than it should.

Also, just for kicks, you might try doing 'chio ielem'.  With some
changers, that's needed to tell it to go figure out what it has.

> (The most recent previous experience I had was with an ADIC 7-slot DLT
> autoloader & drive; no problems anything like this.  This one is made
> by "Bason", and is an 8-slot AIT-3 autoloader & drive.)
> 
> The dmesg weighs in at about 56 KB, so I'm a little reluctant to post
> it here, but I'll be happy to provide it privately (or put it up on
> my Web server someplace), if that's called for.

Feel free to email it to me directly.

> Finally, please include me in replies; I'm not subscribed to -scsi at .

Done.

Ken
-- 
Kenneth Merry
ken at kdm.org