Stable/9 from today mpssas_scsiio timeouts

Tue Jul 9 15:56:56 UTC 2013

On Tue, Jul 09, 2013 at 11:46:24AM -0400, Outback Dingo wrote:
> On Tue, Jul 9, 2013 at 11:30 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> 
> > On Tue, Jul 09, 2013 at 11:20:45AM -0400, Outback Dingo wrote:
> > > On Tue, Jul 9, 2013 at 10:46 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> > >
> > > > On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote:
> > > > > On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo <
> > outbackdingo at gmail.com
> > > > >wrote:
> > > > > > On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick <jdc at koitsu.org>
> > > > wrote:
> > > > > >
> > > > > >> On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote:
> > > > > >> > as of stable today im seeing alot of new mps time outs
> > > > > >> >
> > > > > >> > 9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul  8 16:34:28
> > UTC
> > > > 2013
> > > > > >> > root@:/usr/obj/nas/usr/src/sys/
> > > > > >> >
> > > > > >> > mps1 at pci0:130:0:0:      class=0x010700 card=0x30201000
> > > > chip=0x00721000
> > > > > >> > rev=0x03 hdr=0x00
> > > > > >> >     vendor     = 'LSI Logic / Symbios Logic'
> > > > > >> >     device     = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
> > > > > >> >     class      = mass storage
> > > > > >> >     subclass   = SAS
> > > > > >> >
> > > > > >> >
> > > > > >> > mps0: mpssas_scsiio_timeout checking sc 0xffffff8002145000 cm
> > > > > >> > 0xffffff80021a6b78
> > > > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
> > > > SMID
> > > > > >> 983
> > > > > >> > command timeout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800
> > > > > >> > mps0: mpssas_alloc_tm freezing simq
> > > > > >> > mps0: timedout cm 0xffffff80021a6b78 allocated tm
> > 0xffffff80021587b0
> > > > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
> > > > SMID
> > > > > >> 983
> > > > > >> > completed timedout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800
> > > > during
> > > > > >> > recovery ioc 8048 scsi 0 state c xfer 0
> > > > > >> > (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a
> > code
> > > > 0x0
> > > > > >> count
> > > > > >> > 1
> > > > > >> > (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting
> > > > TaskMID
> > > > > >> 983
> > > > > >> > mps0: mpssas_free_tm releasing simq
> > > > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00
> > > > > >> > (probe40:mps0:0:40:0): CAM status: Command timeout
> > > > > >> > (probe40:mps0:0:40:0): Retrying command
> > > > > >> > mps1: mpssas_scsiio_timeout checking sc 0xffffff8002384000 cm
> > > > > >> > 0xffffff80023e5b78
> > > > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length
> > 36
> > > > SMID
> > > > > >> 983
> > > > > >> > command timeout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800
> > > > > >> > mps1: mpssas_alloc_tm freezing simq
> > > > > >> > mps1: timedout cm 0xffffff80023e5b78 allocated tm
> > 0xffffff80023977b0
> > > > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length
> > 36
> > > > SMID
> > > > > >> 983
> > > > > >> > completed timedout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800
> > > > during
> > > > > >> > recovery ioc 8048 scsi 0 state c xfer 0
> > > > > >> > (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a
> > code
> > > > 0x0
> > > > > >> count
> > > > > >> > 1
> > > > > >> > (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting
> > > > TaskMID
> > > > > >> 983
> > > > > >> > mps1: mpssas_free_tm releasing simq
> > > > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00
> > > > > >> > (probe292:mps1:0:37:0): CAM status: Command timeout
> > > > > >> > (probe292:mps1:0:37:0): Retrying command
> > > > > >>
> > > > > >> 1. What revision were you running before (i.e. what were you on
> > prior
> > > > to
> > > > > >> the upgrade)?
> > > > > >>
> > > > > >
> > > > > >
> > > > > > Sorry I was on 252595 from July 3
> > > >
> > > > And does rolling back to r252595 resolve the problem for you?
> > > >
> > > > Because the only commit I see between r253035 and r252595 that might
> > > > account for some kind of behavioural change, unless I missed one while
> > > > skimming the commit history, is the following:
> > > >
> > > > r252730 -- http://www.freshbsd.org/commit/freebsd/r252730
> > > >
> > > > If at all possible, please try updating to r253037 or newer to see
> > > > if that has some effect/improvement.  Why I mention that commit:
> > > >
> > > > r253037 -- http://www.freshbsd.org/commit/freebsd/r253037
> > > >
> > > > Because the only mps(4) changes done in recent days are:
> > > >
> > > > http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log
> > > >
> > > > r253037
> > > > r251899
> > > > r251874
> > > >
> > >
> > > i can say this its between July 4, and 253048, im rolling back to 252723
> > to
> > > validate a good known working state
> >
> > Looking at your dmesg, it looks like the "errors" might be for SAS ports
> > which don't have any actual devices (disks) attached to them, yet parts
> > of the kernel (not sure which layer) are still trying to submit INQUIRY
> > commands to those ports as if they did have disks attached.
> >
> > It looks like you see this behaviour on boot up, and then later during
> > normal operation at some point (a LUN scan or rescan or "bus taste"
> > might cause this to happen; for example I know that "zpool import" in
> > effect can sometimes cause this behaviour -- on one of my systems "zpool
> > import" would cause the servers' floppy drive to spin up/chunk briefly).
> >
> > I'm hoping Steven or mav@ might be able to confirm/deny my theory here.
> >
> 
> I see it even trying to write to the pool via NFS or FTP, which even times
> out on large files
> now, it was all working, and there are 2 controllers setup in an HA
> configuration, but they did
> work fine before, so ill roll back and try an earlier kernel then walk
> forward till i hit the problem.
> my only issue was i moved forward to get the newer ixgbe driver and others
> just commited to stable
> then to find that SAS was now quirky, welcome to stable. Either way the
> overall performance
> on this box has been in question, just havent been able to confirm its the
> enclosure, the nic card,
> or the zpool which is degraded, but 40MB/s via NFS on a 10GBe nic isnt
> good. so tweaking and
> testing seems to be mute until  the box is at least stable again. I do
> appreciate the insight, and will
> do whatevers needed to  hammer down the issue so it can be resolved.

Again, I would strongly suggest trying r253037 or newer first.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |