Avago LSI SAS 3008 & Intel SSD Timeouts

David Gwynne david at gwynne.id.au
Tue Jun 7 23:39:45 UTC 2016


> On 8 Jun 2016, at 09:30, Steven Hartland <killing at multiplay.co.uk> wrote:
> 
> Oh another thing to test is iirc 3008 is supported by mrsas so you might want to try adding the following into loader.conf to switch drivers:
> hw.mfi.mrsas_enable="1"

i believe the 3008s can run two different firmwares, one that provides the mpt2 interface and the other than provides the megaraid sas fusion interface. you have to flash them to switch though, you cant just point a driver at it and hope for the best.

each fw presents different pci ids. eg, in http://pciids.sourceforge.net/v2.2/pci.ids you can see:

	005f  MegaRAID SAS-3 3008 [Fury]
	0097  SAS3008 PCI-Express Fusion-MPT SAS-3

dlg

> 
> On 07/06/2016 23:43, list-news wrote:
>> No, it threw errors on both da6 and da7 and then I stopped it.
>> 
>> Your last e-mail gave me thoughts though.  I have a server with 2008 controllers (entirely different backplane design, cpu, memory, etc).  I've moved the 4 drives to that and I'm running the test now.
>> 
>> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
>> # sysctl dev.mps.0
>> dev.mps.0.spinup_wait_time: 3
>> dev.mps.0.chain_alloc_fail: 0
>> dev.mps.0.enable_ssu: 1
>> dev.mps.0.max_chains: 2048
>> dev.mps.0.chain_free_lowwater: 1176
>> dev.mps.0.chain_free: 2048
>> dev.mps.0.io_cmds_highwater: 510
>> dev.mps.0.io_cmds_active: 0
>> dev.mps.0.driver_version: 20.00.00.00-fbsd
>> dev.mps.0.firmware_version: 17.00.01.00
>> dev.mps.0.disable_msi: 0
>> dev.mps.0.disable_msix: 0
>> dev.mps.0.debug_level: 3
>> dev.mps.0.%parent: pci5
>> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 subdevice=0x3020 class=0x010700
>> dev.mps.0.%location: slot=0 function=0
>> dev.mps.0.%driver: mps
>> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>> 
>> About 1.5 hours has passed at full load, no errors.
>> 
>> gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  Overall throughput seems to be 20-30% less with my rough benchmarks.
>> 
>> I'm not sure if this gets us closer to the answer, if this doesn't time-out on the 2008 controller, it looks like one of these:
>> 1) The Intel drive firmware is being overloaded somehow when connected to the 3008.
>> or
>> 2) The 3008 firmware or driver has an issue reading drive responses, sporadically thinking the command timed-out (when maybe it really didn't).
>> 
>> Puzzle pieces:
>> A) Why does setting tags of 1 on drives connected to the 3008 fix the problem?
>> B) With tags of 255.  Why does postgres (and assuming a large fsync count), seem to cause the problem within minutes?  While running other heavy i/o commands (zpool scrub, bonnie++, fio), all of which show similarly high or higher iops take hours to cause the problem (if ever).
>> 
>> I'll let this continue to run to further test.
>> 
>> Thanks again for all the help.
>> 
>> -Kyle
>> 
>> On 6/7/16 4:22 PM, Steven Hartland wrote:
>>> Always da6?
>>> 
>>> On 07/06/2016 21:19, list-news wrote:
>>>> Sure Steve:
>>>> 
>>>> # cat /boot/loader.conf | grep trim
>>>> vfs.zfs.trim.enabled=0
>>>> 
>>>> # sysctl vfs.zfs.trim.enabled
>>>> vfs.zfs.trim.enabled: 0
>>>> 
>>>> # uptime
>>>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>>> 
>>>> # tail -f /var/log/messages:
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, connector name (    )
>>>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 allocated tm 0xfffffe0001322150
>>>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0001375580
>>>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 16
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010)
>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, connector name (    )
>>>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi 0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi 0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi 0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi 0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
>>>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 804b scsi 0 sta(da6:te c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort TaskMID 1016 status 0x0 code 0x0 count 5
>>>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 finished recovery after aborting TaskMID 1016
>>>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>>>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>>>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>>>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>>>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>>>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>>>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  Event: [16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  Event: [1c]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: Flags: 0
>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  Event: [16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16]
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per sense data)
>>>> 
>>>> -Kyle
>>>> 
>>>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO delete but assuming your running ZFS can you set the following in loader.conf and see how you get on.
>>>>> vfs.zfs.trim.enabled=0
>>>>> 
>>>>>    Regards
>>>>>    Steve
>>>> 
>>>> 
>>>> _______________________________________________
>>>> freebsd-scsi at freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>>> 
>>> _______________________________________________
>>> freebsd-scsi at freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>> 
>> 
>> _______________________________________________
>> freebsd-scsi at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
> 
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"



More information about the freebsd-scsi mailing list