Stable/9 from today mpssas_scsiio timeouts

Tue Jul 9 15:20:46 UTC 2013

On Tue, Jul 9, 2013 at 10:46 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:

> On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote:
> > On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo <outbackdingo at gmail.com
> >wrote:
> > > On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick <jdc at koitsu.org>
> wrote:
> > >
> > >> On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote:
> > >> > as of stable today im seeing alot of new mps time outs
> > >> >
> > >> > 9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul  8 16:34:28 UTC
> 2013
> > >> > root@:/usr/obj/nas/usr/src/sys/
> > >> >
> > >> > mps1 at pci0:130:0:0:      class=0x010700 card=0x30201000
> chip=0x00721000
> > >> > rev=0x03 hdr=0x00
> > >> >     vendor     = 'LSI Logic / Symbios Logic'
> > >> >     device     = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
> > >> >     class      = mass storage
> > >> >     subclass   = SAS
> > >> >
> > >> >
> > >> > mps0: mpssas_scsiio_timeout checking sc 0xffffff8002145000 cm
> > >> > 0xffffff80021a6b78
> > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
> SMID
> > >> 983
> > >> > command timeout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800
> > >> > mps0: mpssas_alloc_tm freezing simq
> > >> > mps0: timedout cm 0xffffff80021a6b78 allocated tm 0xffffff80021587b0
> > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
> SMID
> > >> 983
> > >> > completed timedout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800
> during
> > >> > recovery ioc 8048 scsi 0 state c xfer 0
> > >> > (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a code
> 0x0
> > >> count
> > >> > 1
> > >> > (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting
> TaskMID
> > >> 983
> > >> > mps0: mpssas_free_tm releasing simq
> > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00
> > >> > (probe40:mps0:0:40:0): CAM status: Command timeout
> > >> > (probe40:mps0:0:40:0): Retrying command
> > >> > mps1: mpssas_scsiio_timeout checking sc 0xffffff8002384000 cm
> > >> > 0xffffff80023e5b78
> > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
> SMID
> > >> 983
> > >> > command timeout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800
> > >> > mps1: mpssas_alloc_tm freezing simq
> > >> > mps1: timedout cm 0xffffff80023e5b78 allocated tm 0xffffff80023977b0
> > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36
> SMID
> > >> 983
> > >> > completed timedout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800
> during
> > >> > recovery ioc 8048 scsi 0 state c xfer 0
> > >> > (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a code
> 0x0
> > >> count
> > >> > 1
> > >> > (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting
> TaskMID
> > >> 983
> > >> > mps1: mpssas_free_tm releasing simq
> > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00
> > >> > (probe292:mps1:0:37:0): CAM status: Command timeout
> > >> > (probe292:mps1:0:37:0): Retrying command
> > >>
> > >> 1. What revision were you running before (i.e. what were you on prior
> to
> > >> the upgrade)?
> > >>
> > >
> > >
> > > Sorry I was on 252595 from July 3
>
> And does rolling back to r252595 resolve the problem for you?
>
> Because the only commit I see between r253035 and r252595 that might
> account for some kind of behavioural change, unless I missed one while
> skimming the commit history, is the following:
>
> r252730 -- http://www.freshbsd.org/commit/freebsd/r252730
>
> If at all possible, please try updating to r253037 or newer to see
> if that has some effect/improvement.  Why I mention that commit:
>
> r253037 -- http://www.freshbsd.org/commit/freebsd/r253037
>
> Because the only mps(4) changes done in recent days are:
>
> http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log
>
> r253037
> r251899
> r251874
>

i can say this its between July 4, and 253048, im rolling back to 252723 to
validate a good known working state

>
> Else I'd say what you're experiencing is legitimate/unrelated to kernel
> changes.  I can only speculate.
>
> The messages to me indicate that some part of the kernel is submitting a
> SCSI INQUIRY request to the underlying device(s) which results in a CAM
> timeout, i.e. the disk attached to the controller did not respond
> promptly (while the controller seemed to be alive/well).
>
> If these disks (which we do not know the type of -- no dmesg provided,
> etc.) are SSDs then TRIM behaviour is possibly causing the drive to take
> too long to perform its TRIM cleanup, or, the drives themselves are
> doing some kind of garbage collection which is taking quite a long time.
>
> Steven et all may have a different (and almost certainly more accurate)
> analysis.
>
> It would really help if you could provide "dmesg" from the machine, as
> well as any details about your setup (if ZFS, "zpool status", etc.), in
> addition to (if SSDs) "sysctl -a | grep -i trim".  All this matters.
>
> --
> | Jeremy Chadwick                                   jdc at koitsu.org |
> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
> | Making life hard for others since 1977.             PGP 4BD6C0CB |
>
>