Avago LSI SAS 3008 & Intel SSD Timeouts

Steven Hartland killing at multiplay.co.uk
Tue Jun 7 23:30:18 UTC 2016


Oh another thing to test is iirc 3008 is supported by mrsas so you might 
want to try adding the following into loader.conf to switch drivers:
hw.mfi.mrsas_enable="1"

On 07/06/2016 23:43, list-news wrote:
> No, it threw errors on both da6 and da7 and then I stopped it.
>
> Your last e-mail gave me thoughts though.  I have a server with 2008 
> controllers (entirely different backplane design, cpu, memory, etc).  
> I've moved the 4 drives to that and I'm running the test now.
>
> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
> # sysctl dev.mps.0
> dev.mps.0.spinup_wait_time: 3
> dev.mps.0.chain_alloc_fail: 0
> dev.mps.0.enable_ssu: 1
> dev.mps.0.max_chains: 2048
> dev.mps.0.chain_free_lowwater: 1176
> dev.mps.0.chain_free: 2048
> dev.mps.0.io_cmds_highwater: 510
> dev.mps.0.io_cmds_active: 0
> dev.mps.0.driver_version: 20.00.00.00-fbsd
> dev.mps.0.firmware_version: 17.00.01.00
> dev.mps.0.disable_msi: 0
> dev.mps.0.disable_msix: 0
> dev.mps.0.debug_level: 3
> dev.mps.0.%parent: pci5
> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 
> subdevice=0x3020 class=0x010700
> dev.mps.0.%location: slot=0 function=0
> dev.mps.0.%driver: mps
> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>
> About 1.5 hours has passed at full load, no errors.
>
> gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  
> Overall throughput seems to be 20-30% less with my rough benchmarks.
>
> I'm not sure if this gets us closer to the answer, if this doesn't 
> time-out on the 2008 controller, it looks like one of these:
> 1) The Intel drive firmware is being overloaded somehow when connected 
> to the 3008.
> or
> 2) The 3008 firmware or driver has an issue reading drive responses, 
> sporadically thinking the command timed-out (when maybe it really 
> didn't).
>
> Puzzle pieces:
> A) Why does setting tags of 1 on drives connected to the 3008 fix the 
> problem?
> B) With tags of 255.  Why does postgres (and assuming a large fsync 
> count), seem to cause the problem within minutes?  While running other 
> heavy i/o commands (zpool scrub, bonnie++, fio), all of which show 
> similarly high or higher iops take hours to cause the problem (if ever).
>
> I'll let this continue to run to further test.
>
> Thanks again for all the help.
>
> -Kyle
>
> On 6/7/16 4:22 PM, Steven Hartland wrote:
>> Always da6?
>>
>> On 07/06/2016 21:19, list-news wrote:
>>> Sure Steve:
>>>
>>> # cat /boot/loader.conf | grep trim
>>> vfs.zfs.trim.enabled=0
>>>
>>> # sysctl vfs.zfs.trim.enabled
>>> vfs.zfs.trim.enabled: 0
>>>
>>> # uptime
>>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>>
>>> # tail -f /var/log/messages:
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 
>>> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>>> connector name (    )
>>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 
>>> allocated tm 0xfffffe0001322150
>>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
>>> Aborting command 0xfffffe0001375580
>>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from 
>>> mprsas_send_abort for target ID 16
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 
>>> command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 
>>> 16, handle(0x0010)
>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>>> connector name (    )
>>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 
>>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 
>>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 
>>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 
>>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 
>>> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 
>>> scsi 0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 
>>> 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 
>>> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 
>>> b0 00
>>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during 
>>> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 
>>> terminated ioc 804b scsi 0 sta(da6:te c xfer 0
>>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort 
>>> TaskMID 1016 status 0x0 code 0x0 count 5
>>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 
>>> finished recovery after aborting TaskMID 1016
>>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  
>>> Event: [16]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: 
>>> [16]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  
>>> Event: [1c]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: 
>>> [1c]
>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>> Jun  7 15:13:50 s18 kernel: Flags: 0
>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  
>>> Event: [16]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: 
>>> [16]
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI 
>>> Status Error
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check 
>>> Condition
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
>>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per 
>>> sense data)
>>>
>>> -Kyle
>>>
>>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using 
>>>> BIO delete but assuming your running ZFS can you set the following 
>>>> in loader.conf and see how you get on.
>>>> vfs.zfs.trim.enabled=0
>>>>
>>>>     Regards
>>>>     Steve
>>>
>>>
>>> _______________________________________________
>>> freebsd-scsi at freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>>
>> _______________________________________________
>> freebsd-scsi at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"



More information about the freebsd-scsi mailing list