Stable/9 from today mpssas_scsiio timeouts

Tue Jul 9 14:46:32 UTC 2013

On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote:
> On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo <outbackdingo at gmail.com>wrote:
> > On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> >
> >> On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote:
> >> > as of stable today im seeing alot of new mps time outs
> >> >
> >> > 9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul  8 16:34:28 UTC 2013
> >> > root@:/usr/obj/nas/usr/src/sys/
> >> >
> >> > mps1 at pci0:130:0:0:      class=0x010700 card=0x30201000 chip=0x00721000
> >> > rev=0x03 hdr=0x00
> >> >     vendor     = 'LSI Logic / Symbios Logic'
> >> >     device     = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
> >> >     class      = mass storage
> >> >     subclass   = SAS
> >> >
> >> >
> >> > mps0: mpssas_scsiio_timeout checking sc 0xffffff8002145000 cm
> >> > 0xffffff80021a6b78
> >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
> >> 983
> >> > command timeout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800
> >> > mps0: mpssas_alloc_tm freezing simq
> >> > mps0: timedout cm 0xffffff80021a6b78 allocated tm 0xffffff80021587b0
> >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
> >> 983
> >> > completed timedout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800 during
> >> > recovery ioc 8048 scsi 0 state c xfer 0
> >> > (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a code 0x0
> >> count
> >> > 1
> >> > (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting TaskMID
> >> 983
> >> > mps0: mpssas_free_tm releasing simq
> >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00
> >> > (probe40:mps0:0:40:0): CAM status: Command timeout
> >> > (probe40:mps0:0:40:0): Retrying command
> >> > mps1: mpssas_scsiio_timeout checking sc 0xffffff8002384000 cm
> >> > 0xffffff80023e5b78
> >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
> >> 983
> >> > command timeout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800
> >> > mps1: mpssas_alloc_tm freezing simq
> >> > mps1: timedout cm 0xffffff80023e5b78 allocated tm 0xffffff80023977b0
> >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID
> >> 983
> >> > completed timedout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800 during
> >> > recovery ioc 8048 scsi 0 state c xfer 0
> >> > (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a code 0x0
> >> count
> >> > 1
> >> > (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting TaskMID
> >> 983
> >> > mps1: mpssas_free_tm releasing simq
> >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00
> >> > (probe292:mps1:0:37:0): CAM status: Command timeout
> >> > (probe292:mps1:0:37:0): Retrying command
> >>
> >> 1. What revision were you running before (i.e. what were you on prior to
> >> the upgrade)?
> >>
> >
> >
> > Sorry I was on 252595 from July 3

And does rolling back to r252595 resolve the problem for you?

Because the only commit I see between r253035 and r252595 that might
account for some kind of behavioural change, unless I missed one while
skimming the commit history, is the following:

r252730 -- http://www.freshbsd.org/commit/freebsd/r252730

If at all possible, please try updating to r253037 or newer to see
if that has some effect/improvement.  Why I mention that commit:

r253037 -- http://www.freshbsd.org/commit/freebsd/r253037

Because the only mps(4) changes done in recent days are:

http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log

r253037
r251899
r251874

Else I'd say what you're experiencing is legitimate/unrelated to kernel
changes.  I can only speculate.

The messages to me indicate that some part of the kernel is submitting a
SCSI INQUIRY request to the underlying device(s) which results in a CAM
timeout, i.e. the disk attached to the controller did not respond
promptly (while the controller seemed to be alive/well).

If these disks (which we do not know the type of -- no dmesg provided,
etc.) are SSDs then TRIM behaviour is possibly causing the drive to take
too long to perform its TRIM cleanup, or, the drives themselves are
doing some kind of garbage collection which is taking quite a long time.

Steven et all may have a different (and almost certainly more accurate)
analysis.

It would really help if you could provide "dmesg" from the machine, as
well as any details about your setup (if ZFS, "zpool status", etc.), in
addition to (if SSDs) "sysctl -a | grep -i trim".  All this matters.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |