Avago LSI SAS 3008 & Intel SSD Timeouts

Tue Jun 7 23:28:37 UTC 2016

If that works I'd switch the 3008 into the machine with 2008 in 
currently and retest.  That will help to confirm the 3008 card and 
driver is or isn't a potential issue.

On 07/06/2016 23:43, list-news wrote:
> No, it threw errors on both da6 and da7 and then I stopped it.
>
> Your last e-mail gave me thoughts though.  I have a server with 2008 
> controllers (entirely different backplane design, cpu, memory, etc).  
> I've moved the 4 drives to that and I'm running the test now.
>
> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
> # sysctl dev.mps.0
> dev.mps.0.spinup_wait_time: 3
> dev.mps.0.chain_alloc_fail: 0
> dev.mps.0.enable_ssu: 1
> dev.mps.0.max_chains: 2048
> dev.mps.0.chain_free_lowwater: 1176
> dev.mps.0.chain_free: 2048
> dev.mps.0.io_cmds_highwater: 510
> dev.mps.0.io_cmds_active: 0
> dev.mps.0.driver_version: 20.00.00.00-fbsd
> dev.mps.0.firmware_version: 17.00.01.00
> dev.mps.0.disable_msi: 0
> dev.mps.0.disable_msix: 0
> dev.mps.0.debug_level: 3
> dev.mps.0.%parent: pci5
> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 
> subdevice=0x3020 class=0x010700
> dev.mps.0.%location: slot=0 function=0
> dev.mps.0.%driver: mps
> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>
> About 1.5 hours has passed at full load, no errors.
>
> gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  
> Overall throughput seems to be 20-30% less with my rough benchmarks.
>
> I'm not sure if this gets us closer to the answer, if this doesn't 
> time-out on the 2008 controller, it looks like one of these:
> 1) The Intel drive firmware is being overloaded somehow when connected 
> to the 3008.
> or
> 2) The 3008 firmware or driver has an issue reading drive responses, 
> sporadically thinking the command timed-out (when maybe it really 
> didn't).
>
> Puzzle pieces:
> A) Why does setting tags of 1 on drives connected to the 3008 fix the 
> problem?
> B) With tags of 255.  Why does postgres (and assuming a large fsync 
> count), seem to cause the problem within minutes?  While running other 
> heavy i/o commands (zpool scrub, bonnie++, fio), all of which show 
> similarly high or higher iops take hours to cause the problem (if ever).
>
> I'll let this continue to run to further test.
>
> Thanks again for all the help.