[Bug 224496] mpr and mps drivers seems to have issues with large seagate drives
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Wed May 6 20:21:59 UTC 2020
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496
--- Comment #41 from Sharad Ahlawat <freebsd at diyit.org> ---
(In reply to Sharad Ahlawat from comment #28)
Symptom:
CAM retry and timeout messages leading to controller aborts and resets
Cause:
slow disks or using SMR disks
Workaround:
Increase the CAM timeout defaults
❯ sysctl kern.cam.da.default_timeout=90
kern.cam.da.default_timeout: 60 -> 90
❯ sysctl kern.cam.ada.default_timeout=60
kern.cam.ada.default_timeout: 30 -> 60
And disable NCQ on SMR Seagates:
❯ cat cam-tags.sh
#!/usr/local/bin/bash
#shrinking the command Native Command Queue down to 1 effectively disabling
queuing
for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d")"
-f1`;
do
camcontrol tags $Disk -N 1 ;
# camcontrol tags $Disk -v
done
If you only have SMRs in your setup and use an UPS you could also:
❯ sysctl vfs.zfs.cache_flush_disable=1
Solution:
don't use slow disks and SMRs disks with ZFS
The long version:
I am obliged to post this update given the driver downgrade workaround I
previously posted on this thread before getting to the root cause for these
logs in my messages file after upgrading to 12.x
Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0
Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs
00000100 tfd c0 serr 00000000 cmd 0000c817
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 00
00 00 40 00 00 00 00 00 00
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more
tries remain
Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms
(tfd = 00000080)
Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0
Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs
00000200 tfd 80 serr 00000000 cmd 0000c917
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00
00 00 00 40 00 00 00 00 00 00
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command
timeout
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 more
tries remain
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1039 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1357 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1933 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 37 c8 00 00 00 b0 00 00
Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 38 78 00 00 00 58 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a1 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 SMID
621 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 476 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 321 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1873 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1852 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1742 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 387 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 2104 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a2 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a3 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a4 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a5 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a6 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a7 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a8 d8 00 00 00 c0 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 37 c0 00 00 00 08 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 36 c0 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND
asc:47,3 (Information unit iuCRC error detected)
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense
data)
Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0
Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs
00400000 tfd 40 serr 00000000 cmd 0000d617
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80
80 44 e8 40 c9 01 00 00 00 00
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more
tries remain
Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms
(tfd = 00000080)
❯ smartctl -a /dev/da4
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline
- 0
199 UDMA_CRC_Error_Count 0x003e 200 178 000 Old_age Always
- 419
Note: Every controller ABORTED COMMAND increases the ID# 199 count
Messages are correlated with sustained disk operations events:
scrubs
large file transfers
cron jobs kicking off on multiple jails accessing the same pool, etc.
Ruled out any issues with:
loose cable
cable quality
controller hardware, bios and firmware
power supply
❯ mpsutil show all
Adapter:
mps0 Adapter:
Board Name: SAS9207-4i4e
Board Assembly: H3-25434-00H
Chip Name: LSISAS2308
Chip Revision: ALL
BIOS Revision: 7.39.02.00
Firmware Revision: 20.00.07.00
Integrated RAID: no
PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device
0 0001 0009 N 6.0 1.5 6.0 SAS Initiator
1 0002 000a N 6.0 1.5 6.0 SAS Initiator
2 0003 000b N 6.0 1.5 6.0 SAS Initiator
3 0004 000c N 6.0 1.5 6.0 SAS Initiator
4 0005 000d N 6.0 1.5 6.0 SAS Initiator
5 0006 000e N 6.0 1.5 6.0 SAS Initiator
6 0007 000f N 6.0 1.5 6.0 SAS Initiator
7 0008 0010 N 6.0 1.5 6.0 SAS Initiator
Devices:
B____T SAS Address Handle Parent Device Speed Enc Slot Wdt
00 10 4433221100000000 0009 0001 SATA Target 6.0 0001 00 1
00 09 4433221101000000 000a 0002 SATA Target 6.0 0001 01 1
00 11 4433221102000000 000b 0003 SATA Target 6.0 0001 02 1
00 07 4433221103000000 000c 0004 SATA Target 6.0 0001 03 1
00 13 4433221104000000 000d 0005 SATA Target 6.0 0001 07 1
00 04 4433221105000000 000e 0006 SATA Target 6.0 0001 06 1
00 14 4433221106000000 000f 0007 SATA Target 6.0 0001 05 1
00 12 4433221107000000 0010 0008 SATA Target 6.0 0001 04 1
Enclosures:
Slots Logical ID SEPHandle EncHandle Type
08 500605b0073400a0 0001 Direct Attached SGPIO
Expanders:
NumPhys SAS Address DevHandle Parent EncHandle SAS Level
❯ camcontrol devlist
<Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus0 target 0 lun 0 (ada0,pass0)
<Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus1 target 0 lun 0 (ada1,pass1)
<Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus2 target 0 lun 0 (ada2,pass2)
<Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus3 target 0 lun 0 (ada3,pass3)
<ST8000DM004-2CX188 0001> at scbus4 target 0 lun 0 (ada4,pass4)
<ST8000DM004-2CX188 0001> at scbus5 target 0 lun 0 (ada5,pass5)
<ST8000DM004-2CX188 0001> at scbus6 target 0 lun 0 (ada6,pass6)
<ST8000DM004-2CX188 0001> at scbus7 target 0 lun 0 (ada7,pass7)
<WDC WD20EFRX-68EUZN0 80.00A80> at scbus8 target 0 lun 0 (ada8,pass8)
<WDC WD20EFRX-68EUZN0 80.00A80> at scbus9 target 0 lun 0 (ada9,pass9)
<ATA ST4000DX001-1CE1 CC45> at scbus10 target 4 lun 0 (pass10,da0)
<ATA ST4000NM0024-1HT SN05> at scbus10 target 7 lun 0 (pass11,da1)
<ATA ST4000NM0024-1HT SN05> at scbus10 target 9 lun 0 (pass12,da2)
<ATA ST4000NM0024-1HT SN02> at scbus10 target 10 lun 0 (pass13,da3)
<ATA ST4000NM0024-1HT SN06> at scbus10 target 11 lun 0 (pass14,da4)
<ATA ST4000DX001-1CE1 CC44> at scbus10 target 12 lun 0 (pass15,da5)
<ATA ST4000DX001-1CE1 CC45> at scbus10 target 13 lun 0 (pass16,da6)
<ATA ST4000DX001-1CE1 CC45> at scbus10 target 14 lun 0 (pass17,da7)
Notice: da4 has a different firmware revision and most probably the reason that
I see relatively fewer timeouts errors on da1, da2 and da3, the other disks in
the same ZFS pool.
To understand these errors messages:
https://en.wikipedia.org/wiki/SCSI_command
https://en.wikipedia.org/wiki/SCSI_CDB
❯ cat /var/log/messages | grep "SCSI status"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition
https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition
https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition
❯ cat /var/log/messages | grep "SCSI sense"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND
asc:47,3 (Information unit iuCRC error detected)
https://en.wikipedia.org/wiki/Key_Code_Qualifier
B 47 00 Aborted Command - SCSI parity error
https://www.t10.org/lists/asc-num.htm#ASC_47
47h/03h DZTPROMAEBKVF INFORMATION UNIT iuCRC ERROR DETECTED
CPB:
- Byte 0-1 : other
- Byte 2-9 : 8 bytes is the logical block address
- Byte 10-13 : 4 bytes is the transfer length
- Byte 14-5 : other
So in my logs we are writing 0x100 = 256 bytes of data and throwing errors,
writing to consecutive blocks. The retries generally succeed and subsequent
scrubs don't find any errors, though you may see CKSUM errors during a scrub
which I associated with the abort and controller reset events.
Summary:
12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more data
back and forth from the disks and that has led to these timeout events
I also experimented and found that reducing the memory clock speed in the bios
is another workaround to stop these timeout errors with 12.x. Basically that is
what my earlier posted workaround of using 11.x mps drivers with 12.x was
effectively doing, slowing down the data transfer rates so the disks could keep
up.
Disclaimer:
These observations, analysis and workarounds are specific to my setup, please
adapt and incorporate accordingly.
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-bugs
mailing list