9211 (LSI/SAS) issues on 11.2-STABLE

Tue Feb 5 22:50:21 UTC 2019

BTW under 12.0-STABLE (built this afternoon after the advisories came
out, with the patches) it's MUCH worse.  I get the same device resets
BUT it's followed by an immediate panic which I cannot dump as it
generates a page-fault (supervisor read data, page not present) in the
mps *driver* at mpssas_send_abort+0x21.

This precludes a dump of course since attempting to do so gives you a
double-panic (I was wondering why I didn't get a crash dump!); I'll
re-jigger the box to stick a dump device on an internal SATA device so I
can successfully get the dump when it happens and see if I can obtain a
proper crash dump on this.

I think it's fair to assume that 12.0-STABLE should not panic on a disk
problem (unless of course the problem is trying to page something back
in -- it's not, the drive that aborts and resets is on a data pack doing
a scrub)

On 2/5/2019 10:26, Karl Denninger wrote:
> On 2/5/2019 09:22, Karl Denninger wrote:
>> On 2/2/2019 12:02, Karl Denninger wrote:
>>> I recently started having some really oddball things  happening under
>>> stress.  This coincided with the machine being updated to 11.2-STABLE
>>> (FreeBSD 11.2-STABLE #1 r342918:) from 11.1.
>>>
>>> Specifically, I get "errors" like this:
>>>
>>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> length 131072 SMID 269 Aborting command 0xfffffe0001179110
>>> mps0: Sending reset from mpssas_send_abort for target ID 37
>>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
>>> c xfer 0
>>>         (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
>>> c xfer 0
>>> mps0: Unfreezing devq for target ID 37
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: Command timeout
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: CCB request completed with an error
>>> (da12:mps0:0:37:0): Retrying command
>>> (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
>>> (da12:mps0:0:37:0): CAM status: SCSI Status Error
>>> (da12:mps0:0:37:0): SCSI status: Check Condition
>>> (da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
>>> reset, or bus device reset occurred)
>>> (da12:mps0:0:37:0): Retrying command (per sense data)
>>>
>>> The "Unit Attention" implies the drive reset.  It only occurs on certain
>>> drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
>>> on two different brands of disk across multiple firmware and capacities,
>>> however, which tends to point away from a drive firmware problem.
>>>
>>> A look at the pool data shows /no /errors (e.g. no checksum problems,
>>> etc) and a look at the disk itself (using smartctl) shows no problems
>>> either -- whatever is going on here the adapter is recovering from it
>>> without any data corruption or loss registered on *either end*!
>>>
>>> The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:
>>>
>>> mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
>>> 0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
>> After considerable additional work this looks increasingly like either a
>> missed interrupt or a command is getting lost between the host adapter
>> and the expander.
>>
>> I'm going to turn the driver debug level up and see if I can capture
>> more information..... whatever is behind this, however, it is
>> almost-certainly related to something that changed between 11.1 and
>> 11.2, as I never saw these on the 11.1-STABLE build.
>>
>> --
>> Karl Denninger
>> karl at denninger.net <mailto:karl at denninger.net>
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
> Pretty decent trace here -- any ideas?
>
> mps0: timedout cm 0xfffffe00011b5020 allocated tm 0xfffffe00011812a0
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> length 131072 SMID 634 Aborting command 0xfffffe00011b5020
> mps0: Sending reset from mpssas_send_abort for target ID 37
> mps0: queued timedout cm 0xfffffe00011c2760 for processing by tm
> 0xfffffe00011812a0
> mps0: queued timedout cm 0xfffffe00011a74f0 for processing by tm
> 0xfffffe00011812a0
> mps0: queued timedout cm 0xfffffe00011cfd50 for processing by tm
> 0xfffffe00011812a0
> mps0: EventReply        :
>         EventDataLength: 2
>         AckRequired: 0
>         Event: SasDiscovery (0x16)
>         EventContext: 0x0
>         Flags: 1<InProgress>
>         ReasonCode: Discovery Started
>         PhysicalPort: 0
>         DiscoveryStatus: 0
> mps0: (0)->(mpssas_fw_work) Working on  Event: [16]
> mps0: (1)->(mpssas_fw_work) Event Free: [16]
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00
> length 131072 SMID 961 completed timedout cm 0xfffffe00011cfd50 ccb
> 0xfffff8019458e000 during recovery ioc 804b scsi 0 state c
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00 01 00 00 length
> 131072 SMID 961 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> length 131072 SMID 634 completed timedout cm
> 0xfffffe00011b5(da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3c 80 00
> 01 00 00
> 020 ccb 0xfffff80056fb5000 during recovery ioc 8048 scsi 0 state
> c(da11:mps0:0:37:0): CAM status: CCB request completed with an error
> (da11:mps0:0:37:0): Retrying command
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
> length 131072 SMID 798 completed timedout cm 0xfffffe00011c2760 ccb
> 0xfffff80054e86000 during recovery ioc 804b scsi 0 state c
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00 length
> 131072 SMID 798 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
>         (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
> length 131072 SMID 467 completed timedout cm 0xfffffe00011a74f0 ccb
> 0xfffff8011efe3000 during recovery ioc 804b scsi 0 state c
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00 length
> 131072 SMID 467 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
>         (xpt0:mps0:0:37:0): SMID 2 abort TaskMID 634 status 0x0 code 0x0
> count 4
>         (xpt0:mps0:0:37:0): SMID 2 finished recovery after aborting
> TaskMID 634
> mps0: Unfreezing devq for target ID 37
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3b 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: Command timeout
> (da11:mps0:0:37:0): Retrying command
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 3a 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: CCB request completed with an error
> (da11:mps0:0:37:0): Retrying command
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: CCB request completed with an error
> (da11:mps0:0:37:0): Retrying command
> mps0: EventReply        :
>         EventDataLength: 4
>         AckRequired: 0
>         Event: SasTopologyChangeList (0x1c)
>         EventContext: 0x0
>         EnclosureHandle: 0x2
>         ExpanderDevHandle: 0x9
>         NumPhys: 25
>         NumEntries: 1
>         StartPhyNum: 12
>         ExpStatus: Responding (0x3)
>         PhysicalPort: 0
>         PHY[12].AttachedDevHandle: 0x0013
>         PHY[12].LinkRate: 6.0Gbps (0xaa)
>         PHY[12].PhyStatus: PHYLinkStatusChange
> mps0: (1)->(mpssas_fw_work) Working on  Event: [1c]
> mps0: (2)->(mpssas_fw_work) Event Free: [1c]
> mps0: EventReply        :
>         EventDataLength: 2
>         AckRequired: 0
>         Event: SasDiscovery (0x16)
>         EventContext: 0x0
>         Flags: 0
>         ReasonCode: Discovery Complete
>         PhysicalPort: 0
>         DiscoveryStatus: 0
> mps0: (2)->(mpssas_fw_work) Working on  Event: [16]
> mps0: (3)->(mpssas_fw_work) Event Free: [16]
> (da11:mps0:0:37:0): READ(10). CDB: 28 00 82 b5 39 80 00 01 00 00
> (da11:mps0:0:37:0): CAM status: SCSI Status Error
> (da11:mps0:0:37:0): SCSI status: Check Condition
> (da11:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
> reset, or bus device reset occurred)
> (da11:mps0:0:37:0): Retrying command (per sense data)
>
> This is what's running right now that produced that -- note no errors
> reported on the two providers (the degraded state is intentional; the
> third unit is off-site in a bank vault):
>
> [karl at NewFS /var/log]$ zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using 'zpool online' or replace the device with
>         'zpool replace'.
>   scan: scrub in progress since Tue Feb  5 06:57:14 2019
>         840G scanned at 200M/s, 517G issued at 123M/s, 3.36T total
>         0 repaired, 15.04% done, 0 days 06:44:17 to go
> config:
>
>         NAME                      STATE     READ WRITE CKSUM
>         backup                    DEGRADED     0     0     0
>           mirror-0                DEGRADED     0     0     0
>             gpt/backup61.eli      ONLINE       0     0     0
>             gpt/backup62-1.eli    ONLINE       0     0     0
>             13282812295755460479  OFFLINE      0     0     0  was
> /dev/gpt/backup62-2.eli
>
> errors: No known data errors
>
> But, here's the SMART data on the disk -- it DOES show the reset:
>
> Before....
>
> Device Statistics (GP Log 0x04)
> Page  Offset Size        Value Flags Description
> 0x01  =====  =               =  ===  == General Statistics (rev 2) ==
> 0x01  0x008  4              19  ---  Lifetime Power-On Resets
> 0x01  0x018  6     32002138534  ---  Logical Sectors Written
> 0x01  0x020  6       155895325  ---  Number of Write Commands
> 0x01  0x028  6    265457016500  ---  Logical Sectors Read
> 0x01  0x030  6      1132601490  ---  Number of Read Commands
> 0x01  0x038  6     61537515300  ---  Date and Time TimeStamp
> 0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
> 0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
> 0x03  0x010  4            6114  ---  Head Flying Hours
> 0x03  0x018  4           30602  ---  Head Load Events
> 0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
> 0x03  0x028  4          283735  ---  Read Recovery Attempts
> 0x03  0x030  4               0  ---  Number of Mechanical Start Failures
> 0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
> 0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
> 0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
> Completio
> n
> 0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
> 0x05  0x008  1              38  ---  Current Temperature
> 0x05  0x010  1              40  N--  Average Short Term Temperature
> 0x05  0x018  1              32  N--  Average Long Term Temperature
> 0x05  0x020  1              52  ---  Highest Temperature
> 0x05  0x028  1              23  ---  Lowest Temperature
> 0x05  0x030  1              48  N--  Highest Average Short Term Temperature
> 0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
> 0x05  0x040  1              44  N--  Highest Average Long Term Temperature
> 0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
> 0x05  0x050  4               0  ---  Time in Over-Temperature
> 0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
> 0x05  0x060  4               0  ---  Time in Under-Temperature
> 0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
> 0x06  0x008  4               6  ---  Number of Hardware Resets
> 0x06  0x010  4               0  ---  Number of ASR Events
> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>                                 |||_ C monitored condition met
>                                 ||__ D supports DSN
>                                 |___ N normalized value
>
> And after....
>
> Device Statistics (GP Log 0x04)
> Page  Offset Size        Value Flags Description
> 0x01  =====  =               =  ===  == General Statistics (rev 2) ==
> 0x01  0x008  4              19  ---  Lifetime Power-On Resets
> 0x01  0x018  6     32004195070  ---  Logical Sectors Written
> 0x01  0x020  6       155946732  ---  Number of Write Commands
> 0x01  0x028  6    266340052468  ---  Logical Sectors Read
> 0x01  0x030  6      1136347463  ---  Number of Read Commands
> 0x01  0x038  6     61540814700  ---  Date and Time TimeStamp
> 0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
> 0x03  0x008  4            6114  ---  Spindle Motor Power-on Hours
> 0x03  0x010  4            6114  ---  Head Flying Hours
> 0x03  0x018  4           30602  ---  Head Load Events
> 0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
> 0x03  0x028  4          283848  ---  Read Recovery Attempts
> 0x03  0x030  4               0  ---  Number of Mechanical Start Failures
> 0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
> 0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
> 0x04  0x010  4              17  ---  Resets Between Cmd Acceptance and
> Completion
> 0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
> 0x05  0x008  1              41  ---  Current Temperature
> 0x05  0x010  1              40  N--  Average Short Term Temperature
> 0x05  0x018  1              32  N--  Average Long Term Temperature
> 0x05  0x020  1              52  ---  Highest Temperature
> 0x05  0x028  1              23  ---  Lowest Temperature
> 0x05  0x030  1              48  N--  Highest Average Short Term Temperature
> 0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
> 0x05  0x040  1              44  N--  Highest Average Long Term Temperature
> 0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
> 0x05  0x050  4               0  ---  Time in Over-Temperature
> 0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
> 0x05  0x060  4               0  ---  Time in Under-Temperature
> 0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
> 0x06  0x008  4               7  ---  Number of Hardware Resets
> 0x06  0x010  4               0  ---  Number of ASR Events
> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>                                 |||_ C monitored condition met
>                                 ||__ D supports DSN
>                                 |___ N normalized value
>
> Number of Hardware Resets has incremented.  There are no other errors shown:
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
>   2 Throughput_Performance  P-S---   137   137   054    -    104
>   3 Spin_Up_Time            POS---   132   132   024    -    497
> (Average 497)
>   4 Start_Stop_Count        -O--C-   099   099   000    -    7074
>   5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
>   7 Seek_Error_Rate         PO-R--   100   100   067    -    0
>   8 Seek_Time_Performance   P-S---   128   128   020    -    18
>   9 Power_On_Hours          -O--C-   098   098   000    -    17094
>  10 Spin_Retry_Count        PO--C-   100   100   060    -    0
>  12 Power_Cycle_Count       -O--CK   100   100   000    -    19
> 192 Power-Off_Retract_Count -O--CK   075   075   000    -    30602
> 193 Load_Cycle_Count        -O--C-   075   075   000    -    30602
> 194 Temperature_Celsius     -O----   146   146   000    -    41 (Min/Max
> 23/52)
> 196 Reallocated_Event_Count -O--CK   100   100   000    -    0
> 197 Current_Pending_Sector  -O---K   100   100   000    -    0
> 198 Offline_Uncorrectable   ---R--   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
>
> I'd throw possible shade at the backplane or cable /but I have already
> swapped both out for spares without any change in behavior./
>
-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190205/91f2e448/attachment-0001.bin>