9211 (LSI/SAS) issues on 11.2-STABLE

Wed Feb 6 15:34:50 UTC 2019

On 2/6/2019 09:18, Borja Marcos wrote:
>> On 5 Feb 2019, at 23:49, Karl Denninger <karl at denninger.net> wrote:
>>
>> BTW under 12.0-STABLE (built this afternoon after the advisories came
>> out, with the patches) it's MUCH worse.  I get the same device resets
>> BUT it's followed by an immediate panic which I cannot dump as it
>> generates a page-fault (supervisor read data, page not present) in the
>> mps *driver* at mpssas_send_abort+0x21.
>> This precludes a dump of course since attempting to do so gives you a
>> double-panic (I was wondering why I didn't get a crash dump!); I'll
>> re-jigger the box to stick a dump device on an internal SATA device so I
>> can successfully get the dump when it happens and see if I can obtain a
>> proper crash dump on this.
>>
>> I think it's fair to assume that 12.0-STABLE should not panic on a disk
>> problem (unless of course the problem is trying to page something back
>> in -- it's not, the drive that aborts and resets is on a data pack doing
>> a scrub)
> It shouldn’t panic I imagine.
>
>>>>> mps0: Sending reset from mpssas_send_abort for target ID 37
>
>>> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
>>> 0x06  0x008  4               6  ---  Number of Hardware Resets
>>> 0x06  0x010  4               0  ---  Number of ASR Events
>>> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>>>                                 |||_ C monitored condition met
>>>                                 ||__ D supports DSN
>>>                                 |___ N normalized value
>>>
>>> 0x06  0x008  4               7  ---  Number of Hardware Resets
>>> 0x06  0x010  4               0  ---  Number of ASR Events
>>> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>>>                                 |||_ C monitored condition met
>>>                                 ||__ D supports DSN
>>>                                 |___ N normalized value
>>>
>>> Number of Hardware Resets has incremented.  There are no other errors shown:
> What is _exactly_ that value? Is it related to the number of resets sent from the HBA
> _or_ the device resetting by itself?
Good question.  What counts as a "reset"; UNIT ATTENTION is what the
controller receives but whether that's a power reset, a reset *command*
from the HBA or a firmware crash (yikes!) in the disk I'm not certain.
>>> I'd throw possible shade at the backplane or cable /but I have already
>>> swapped both out for spares without any change in behavior./
> What about the power supply? 
>
There are multiple other devices and the system board on that supply
(and thus voltage rails) but it too has been swapped out without
change.  In fact at this point other than the system board and RAM
(which is ECC, and is showing no errors in the system's BMC log)
/everything /in the server case (HBA, SATA expander, backplane, power
supply and cables) has been swapped for spares.  No change in behavior.

Note that with 20.0.7.0 firmware in the HBA instead of a unit attention
I get a *controller* reset (!!!) which detaches some random number of
devices from ZFS's point of view before it comes back up (depending on
what's active at the time) which is potentially catastrophic if it hits
the system pool.  I immediately went back to 19.0.0.0 firmware on the
HBA; I had upgraded to 20.0.7.0 since there had been good reports of
stability with it when I first saw this, thinking there was a drive
change that might have resulted in issues with it when running 19.0
firmware on the card.

This system was completely stable for over a year on 11.1-STABLE and in
fact hadn't been rebooted or logged a single "event" in over six months;
the problems started immediately upon upgrade to 11.2-STABLE and
persists on 12.0-STABLE.  The disks in question haven't changed either
(so it can't be a difference in firmware that is in a newer purchased
disk, for example.)

I'm thinking perhaps *something* in the codebase change made the HBA ->
SAS Expander combination trouble where it wasn't before; I've got a
couple of 16i HBAs on the way which will allow me to remove the SAS
expander to see if that causes the problem to disappear.  I've got a
bunch of these Lenovo expanders and have been using them without any
sort of trouble in multiple machines; it's only when I went beyond 11.1
that I started having trouble with them.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190206/d19e133b/attachment.bin>