[Bug 224496] mpr and mps drivers seems to have issues with large seagate drives

Wed May 6 20:21:59 UTC 2020

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496

--- Comment #41 from Sharad Ahlawat <freebsd at diyit.org> ---
(In reply to Sharad Ahlawat from comment #28)

Symptom:
CAM retry and timeout messages leading to controller aborts and resets

Cause:
slow disks or using SMR disks

Workaround:
Increase the CAM timeout defaults

❯ sysctl kern.cam.da.default_timeout=90
kern.cam.da.default_timeout: 60 -> 90
❯ sysctl kern.cam.ada.default_timeout=60
kern.cam.ada.default_timeout: 30 -> 60

And disable NCQ on SMR Seagates:
❯ cat cam-tags.sh
#!/usr/local/bin/bash
#shrinking the command Native Command Queue down to 1 effectively disabling
queuing
for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d")"
-f1`;
do
    camcontrol tags $Disk -N 1 ;
    # camcontrol tags $Disk -v
done

If you only have SMRs in your setup and use an UPS you could also:
❯ sysctl vfs.zfs.cache_flush_disable=1

Solution:
don't use slow disks and SMRs disks with ZFS

The long version:

I am obliged to post this update given the driver downgrade workaround I
previously posted on this thread before getting to the root cause for these
logs in my messages file after upgrading to 12.x

Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0
Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs
00000100 tfd c0 serr 00000000 cmd 0000c817
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 00
00 00 40 00 00 00 00 00 00
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more
tries remain
Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms
(tfd = 00000080)
Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0
Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs
00000200 tfd 80 serr 00000000 cmd 0000c917
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00
00 00 00 40 00 00 00 00 00 00
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command
timeout
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 more
tries remain

Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1039 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1357 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1933 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 37 c8 00 00 00 b0 00 00
Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 38 78 00 00 00 58 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a1 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 SMID
621 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 476 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 321 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1873 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1852 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 1742 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 387 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt
11 SMID 2104 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a2 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a3 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a4 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a5 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a6 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a7 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00
01 b4 c0 a8 d8 00 00 00 c0 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 37 c0 00 00 00 08 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01
45 fb 36 c0 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND
asc:47,3 (Information unit iuCRC error detected)
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense
data)

Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0
Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs
00400000 tfd 40 serr 00000000 cmd 0000d617
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80
80 44 e8 40 c9 01 00 00 00 00
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more
tries remain
Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms
(tfd = 00000080)

❯ smartctl -a /dev/da4
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always      
-       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always      
-       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline     
-       0
199 UDMA_CRC_Error_Count    0x003e   200   178   000    Old_age   Always      
-       419
Note: Every controller ABORTED COMMAND increases the ID# 199 count

Messages are correlated with sustained disk operations events:
        scrubs
        large file transfers
        cron jobs kicking off on multiple jails accessing the same pool, etc.

Ruled out any issues with:
        loose cable
        cable quality
        controller hardware, bios and firmware
        power supply

❯ mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9207-4i4e
   Board Assembly: H3-25434-00H
        Chip Name: LSISAS2308
    Chip Revision: ALL
    BIOS Revision: 7.39.02.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator
2       0003        000b       N         6.0     1.5    6.0    SAS Initiator
3       0004        000c       N         6.0     1.5    6.0    SAS Initiator
4       0005        000d       N         6.0     1.5    6.0    SAS Initiator
5       0006        000e       N         6.0     1.5    6.0    SAS Initiator
6       0007        000f       N         6.0     1.5    6.0    SAS Initiator
7       0008        0010       N         6.0     1.5    6.0    SAS Initiator

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   10   4433221100000000 0009    0001      SATA Target   6.0   0001 00    1
00   09   4433221101000000 000a    0002      SATA Target   6.0   0001 01    1
00   11   4433221102000000 000b    0003      SATA Target   6.0   0001 02    1
00   07   4433221103000000 000c    0004      SATA Target   6.0   0001 03    1
00   13   4433221104000000 000d    0005      SATA Target   6.0   0001 07    1
00   04   4433221105000000 000e    0006      SATA Target   6.0   0001 06    1
00   14   4433221106000000 000f    0007      SATA Target   6.0   0001 05    1
00   12   4433221107000000 0010    0008      SATA Target   6.0   0001 04    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b0073400a0               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

❯ camcontrol devlist

<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)
<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus1 target 0 lun 0 (ada1,pass1)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus2 target 0 lun 0 (ada2,pass2)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus3 target 0 lun 0 (ada3,pass3)
<ST8000DM004-2CX188 0001>          at scbus4 target 0 lun 0 (ada4,pass4)
<ST8000DM004-2CX188 0001>          at scbus5 target 0 lun 0 (ada5,pass5)
<ST8000DM004-2CX188 0001>          at scbus6 target 0 lun 0 (ada6,pass6)
<ST8000DM004-2CX188 0001>          at scbus7 target 0 lun 0 (ada7,pass7)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus8 target 0 lun 0 (ada8,pass8)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus9 target 0 lun 0 (ada9,pass9)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 4 lun 0 (pass10,da0)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 7 lun 0 (pass11,da1)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 9 lun 0 (pass12,da2)
<ATA ST4000NM0024-1HT SN02>        at scbus10 target 10 lun 0 (pass13,da3)
<ATA ST4000NM0024-1HT SN06>        at scbus10 target 11 lun 0 (pass14,da4)
<ATA ST4000DX001-1CE1 CC44>        at scbus10 target 12 lun 0 (pass15,da5)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 13 lun 0 (pass16,da6)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 14 lun 0 (pass17,da7)

Notice: da4 has a different firmware revision and most probably the reason that
I see relatively fewer timeouts errors on da1, da2 and da3, the other disks in
the same ZFS pool.

To understand these errors messages:

https://en.wikipedia.org/wiki/SCSI_command
https://en.wikipedia.org/wiki/SCSI_CDB

❯ cat /var/log/messages | grep "SCSI status"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition

https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition
https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition

❯ cat /var/log/messages | grep "SCSI sense"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND
asc:47,3 (Information unit iuCRC error detected)

https://en.wikipedia.org/wiki/Key_Code_Qualifier
B       47      00      Aborted Command - SCSI parity error

https://www.t10.org/lists/asc-num.htm#ASC_47
47h/03h  DZTPROMAEBKVF  INFORMATION UNIT iuCRC ERROR DETECTED

CPB:
        - Byte 0-1 : other
        - Byte 2-9 : 8 bytes is the logical block address
        - Byte 10-13 : 4 bytes is the transfer length
        - Byte 14-5 : other

So in my logs we are writing 0x100 = 256 bytes of data and throwing errors,
writing to consecutive blocks. The retries generally succeed and subsequent
scrubs don't find any errors, though you may see CKSUM errors during a scrub
which I associated with the abort and controller reset events.

Summary:
12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more data
back and forth from the disks and that has led to these timeout events

I also experimented and found that reducing the memory clock speed in the bios
is another workaround to stop these timeout errors with 12.x. Basically that is
what my earlier posted workaround of using 11.x mps drivers with 12.x was
effectively doing, slowing down the data transfer rates so the disks could keep
up.

Disclaimer:
These observations, analysis and workarounds are specific to my setup, please
adapt and incorporate accordingly.

-- 
You are receiving this mail because:
You are the assignee for the bug.