Strange CAM errors

Willem Jan Withagen wjw at digiware.nl
Mon Dec 17 23:52:10 UTC 2012


On 17-12-2012 23:43, Jim Harris wrote:
> On Mon, Dec 17, 2012 at 3:21 PM, Willem Jan Withagen <wjw at digiware.nl>wrote:
> 
>> On 17-12-2012 23:10, Jim Harris wrote:
>>>
>>>
>>> On Mon, Dec 17, 2012 at 2:45 PM, Willem Jan Withagen <wjw at digiware.nl
>>> <mailto:wjw at digiware.nl>> wrote:
>>>
>>>     On 17-12-2012 20:16, Jim Harris wrote:>
>>>     > The timeouts are occurring on inquiry commands to non-zero LUNs.
>>>     > arcmsr(4) is returning CAM_SEL_TIMEOUT instead of
>>>     CAM_DEV_NOT_THERE for
>>>     > inquiry commands to this device and LUN > 0.  CAM_DEV_NOT_THERE is
>>>     > preferred to remove these types of warnings, and similar patches
>> have
>>>     > gone into for other SCSI drivers recently.
>>>     >
>>>     > Can you try this patch?
>>>     >
>>>     > Index: sys/dev/arcmsr/arcmsr.c
>>>     > ===================================================================
>>>     > --- sys/dev/arcmsr/arcmsr.c     (revision 244190)
>>>     > +++ sys/dev/arcmsr/arcmsr.c     (working copy)
>>>     > @@ -2439,7 +2439,7 @@
>>>     >                 char *buffer=pccb->csio.data_ptr;
>>>     >
>>>     >                 if (pccb->ccb_h.target_lun) {
>>>     > -                       pccb->ccb_h.status |= CAM_SEL_TIMEOUT;
>>>     > +                       pccb->ccb_h.status |= CAM_DEV_NOT_THERE;
>>>     >                         xpt_done(pccb);
>>>     >                         return;
>>>     >                 }
>>>     >
>>>
>>>     Hi Jim,
>>>
>>>     The noise has gone down by a factor of 5, now I get:
>>>
>>>     (probe6:arcmsr0:0:16:1): INQUIRY. CDB: 12 20 0 0 24 0
>>>     (probe6:arcmsr0:0:16:1): CAM status: Unable to terminate I/O CCB
>> request
>>>     (probe6:arcmsr0:0:16:1): Error 5, Unretryable error
>>>     (probe6:arcmsr0:0:16:2): INQUIRY. CDB: 12 40 0 0 24 0
>>>
>>>     Which is defined in sys/cam/cam.c ....
>>>     as CAM_UA_TERMIO, but that error is nowhere set in the arcmsr
>> code....
>>>
>>>
>>> There is something out of sync on your system.  I just noticed this, but
>>> your original error messages were showing "Command timeout"
>>> (CAM_CMD_TIMEOUT) even though the driver was returning CAM_SEL_TIMEOUT.
>>> Now in this case, driver is returning CAM_DEV_NOT_THERE, but CAM is
>>> printing error message for CAM_UA_TERMIO.  In both cases, driver is
>>> returning value X, but cam is interpreting it as X+1.  So CAM and
>>> arcmsr(4) seem to have a different idea of the values of the cam_status
>>> enumeration.
>>>
>>> Can you provide details on your build environment?  Are you building
>>> arcmsr as a loadable module or do you specify "device arcmsr" in your
>>> kernel config to link it statically?  I'm suspecting loadable module,
>>> although I have no idea how these values would get out of sync since
>>> this enumeration hasn't changed in probably 10+ years.
>>
>> arcmsr is build in the kernel
>>
>> [/usr/src] wjw at zfs.digiware.nl> kldstat
>> Id Refs Address            Size     Name
>>  1   28 0xffffffff80200000 b55be0   kernel
>>  2    1 0xffffffff80d56000 6138     nullfs.ko
>>  3    1 0xffffffff80d5d000 2153b0   zfs.ko
>>  4    2 0xffffffff80f73000 5e38     opensolaris.ko
>>  5    1 0xffffffff80f79000 f510     aio.ko
>>  6    1 0xffffffff80f89000 2a20     coretemp.ko
>>  7    1 0xffffffff81012000 316d4    nfscl.ko
>>  8    2 0xffffffff81044000 10827    nfscommon.ko
>>
>> And I just refetched 9.1-PRERELEASE this afternoon over svn....
>>
>> Could this have something to do with Clang <> gcc ????
>> Not that I did anything to change this.
>>
>> Note that I have nothing changed other than the KERNEL CONFIG file.
>>
>> And both kernel and world were build at the same time this afternoon.
>> With your patch I just only rebuild kernel and modules.
>>
>>
> Never mind my earlier comment on out-of-sync.  It's another bug in
> arcmsr(4) - CAM_REQ_CMP == 0x1, and in the LUN > 0 case here it OR's the
> status values together, causing the off-by-one issue we were seeing.
> 
> Please try the following patch instead (reverting earlier patch):
> 
> Index: sys/dev/arcmsr/arcmsr.c
> ===================================================================
> --- sys/dev/arcmsr/arcmsr.c     (revision 244190)
> +++ sys/dev/arcmsr/arcmsr.c     (working copy)
> @@ -2432,14 +2432,13 @@
>  static void arcmsr_handle_virtual_command(struct AdapterControlBlock *acb,
>                 union ccb * pccb)
>  {
> -       pccb->ccb_h.status |= CAM_REQ_CMP;
>         switch (pccb->csio.cdb_io.cdb_bytes[0]) {
>         case INQUIRY: {
>                 unsigned char inqdata[36];
>                 char *buffer=pccb->csio.data_ptr;
> 
>                 if (pccb->ccb_h.target_lun) {
> -                       pccb->ccb_h.status |= CAM_SEL_TIMEOUT;
> +                       pccb->ccb_h.status |= CAM_DEV_NOT_THERE;
>                         xpt_done(pccb);
>                         return;
>                 }
> @@ -2455,6 +2454,7 @@
>                 strncpy(&inqdata[16], "RAID controller ", 16);  /* Product
> Identification */
>                 strncpy(&inqdata[32], "R001", 4); /* Product Revision */
>                 memcpy(buffer, inqdata, sizeof(inqdata));
> +               pccb->ccb_h.status |= CAM_REQ_CMP;
>                 xpt_done(pccb);
>         }
>         break;
> @@ -2464,10 +2464,12 @@
>                         pccb->ccb_h.status |= CAM_SCSI_STATUS_ERROR;
>                         pccb->csio.scsi_status = SCSI_STATUS_CHECK_COND;
>                 }
> +               pccb->ccb_h.status |= CAM_REQ_CMP;
>                 xpt_done(pccb);
>         }
>         break;
>         default:
> +               pccb->ccb_h.status |= CAM_REQ_CMP;
>                 xpt_done(pccb);
>         }
>  }

Right,

That did the trick.....
Thanx for the code.

--WjW




More information about the freebsd-stable mailing list