terminated ioc 804b scsi 0 state c xfer 0

Tue May 31 16:17:48 UTC 2016


> -----Original Message-----
> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
> scsi at freebsd.org] On Behalf Of Dan Langille
> Sent: Monday, May 30, 2016 12:28 PM
> To: freebsd-scsi at freebsd.org
> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
> 
> > On Apr 25, 2016, at 12:38 PM, Stephen McConnell
> <stephen.mcconnell at broadcom.com> wrote:
> >
> >
> >
> >> -----Original Message-----
> >> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
> >> scsi at freebsd.org] On Behalf Of Dan Langille
> >> Sent: Monday, April 25, 2016 9:40 AM
> >> To: freebsd-scsi at freebsd.org
> >> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
> >>
> >>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan at langille.org> wrote:
> >>>
> >>>>
> >>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan at langille.org> wrote:
> >>>>
> >>>> More of the pasted output is also at
> >> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
> >> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>
> >> and added smartctl output.
> >>>>
> >>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI
> >>>> SAS2008
> >> card.  It's running a zfs root system.
> >>>>
> >>>> This morning the system was unresponsive via ssh. Attempts to log
> >>>> in at
> >> the console did not yield a password prompt.
> >>>>
> >>>> A power cycle brought the system online.  Inspecting
> >>>> /var/log/messages,
> > I
> >> found about 63,000 entries similar to those which appear below.
> >>>>
> >>>> zpool status of all are OK. A scrub is in progress for one pool
> >>>> (since
> > before
> >> this issue arose). da7 is in that pool.
> >>>>
> >>>>
> >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b
> >>>> scsi
> >>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
> >>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
> >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00
> >>>> 20
> >>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
> >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc
> >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel:
(da7:mps1:0:17:0):
> >>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
> >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00
> >>>> 18
> >>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
> >>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
> >>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
> >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
> >>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
> >>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr
> >>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b
> >>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi
> >>>> 0 state c xfer 0
> >>>>
> >>>> Is this a cabling issue?  The drive is a SATA device (smartctl
> >>>> output
> > in the
> >> URL above).  Anyone familiar with these errors?
> >>>
> >>> This morning:
> >>>
> >>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
> >>>
> >>> At least I know i'm looking for Serial Number: 13Q8PNBYS
> >>>
> >>> From the logs:
> >>>
> >>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun
> >>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
> >> 13Q8PNBYS detached
> 
> Just for the record, this happened again this morning. Fixed by power
cycle.
> 
> May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending reset
> for target ID 17 May 30 03:22:10 knew kernel: da7 at mps1 bus 0 scbus1
target
> 17 lun 0
> May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
> 13Q8PNBYS detached
> May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c
> 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 state c
xfer
> 0 May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 6b
> bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi 0
state c
> xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 00 May 30
> 03:22:10 knew kernel: er 122880
> 
I just realized that you're using mps, not mpr.  The fix went into the mpr
driver, but not mps yet.  It'll have to be ported over to mps.

Steve

> >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> >>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0
> >>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
> >>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
> >>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10).
CDB:
> > 28
> >> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25
05:34:51
> >> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue
> >> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
> 28
> >> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b
scsi
> > 0
> >> state c xfer(da7: 0
> >>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
> > CDB:
> >> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b
> > scsi 0
> >> state c xfer 17:0
> >>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): WRITE(10).
> > CDB: 2a
> >> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b
scsi
> > 0
> >> state c xfeError 5, Periph was invalidated
> >>> Apr 25 05:34:51 knew kernel: r 0
> >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> >>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
> >>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length
> >>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
> >>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25
> >>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
> >>> mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51
> >>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
> >>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51
> >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00
> >>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00
> >>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> >>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00
> >>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew
> >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00
> > Apr
> >> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally
> > Re-
> >> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5,
> >> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR:
> Device
> >> swap: provider da7p2 disconnected.
> >>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS
'vdev
> > is
> >> removed, pool_guid=15378250086669402288
> >> vdev_guid=13410079654596185797''
> >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr
> >>> 25 05:34:51 knew ZFS: vdev is removed,
> pool_guid=15378250086669402288
> >>> vdev_guid=13410079654596185797
> >>
> >> Current status: after powering off the box, reseating the cables for
that
> > drive, I
> >> powered up the system and a resilver commenced which completed in 30
> >> minutes.
> >>
> >> Seems OK now.  I am not sure if the two events are related.
> >
> > Recently, a bug was uncovered where a device is gets 'lost'.
> >
> > Here's what happens:
> > The first message in your "failure on Monday"  log is for
> > 'mpssas_prepare_remove'.  This message is likely logged because the FW
> sends
> > an event to the driver that the device is no longer responsive (pulled,
> > cable issue, or something else).  When the driver gets this event, it
sends
> > a reset to the device to clear out any pending I/O.  This is where the
> > 'terminated ioc' messages come from.  When the reset completes, the
driver
> > is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle
> for
> > that disk is removed from FW's list.  Then, when the device comes back
> > on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT
> > message is sent to FW, the driver exits the function where that happens
> > (mpssas_remove_device).  This happens where you see the log message,
> > "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs that
> > message and then exits.  What the driver should do is log that message
and
> > continue on to send the SAS_IO_UNIT message to FW.  The fix is to remove
> the
> > two lines in the driver shown here with '>>':
> >
> > 	if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) {
> > 		mps_printf(sc, "IOCStatus = 0x%x while resetting device
> > 0x%x\n",
> > 		   le16toh(reply->IOCStatus), handle);
> >>> 		mpssas_free_tm(sc, tm);
> >>> 		return;
> > 	}
> >
> > A reboot will solve the problem, as you saw, but the real fix is to
remove
> > the DevHandle as described above.  This fix will go into the driver
before
> > the next scheduled release and then MFC'd to 10.x.
> >
> > Steve
> >
> >>
> >> --
> >> Dan Langille - BSDCan / PGCon
> >> dan at langille.org
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> freebsd-scsi at freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
> >
> 
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"