9211 (LSI/SAS) issues on 11.2-STABLE

Thu Feb 7 11:51:17 UTC 2019

> On 6 Feb 2019, at 16:34, Karl Denninger <karl at denninger.net> wrote:
> 
> On 2/6/2019 09:18, Borja Marcos wrote:
>>>> Number of Hardware Resets has incremented.  There are no other errors shown:
>> What is _exactly_ that value? Is it related to the number of resets sent from the HBA
>> _or_ the device resetting by itself?
> Good question.  What counts as a "reset"; UNIT ATTENTION is what the
> controller receives but whether that's a power reset, a reset *command*
> from the HBA or a firmware crash (yikes!) in the disk I'm not certain.

In my youth I wrote software for tape drives. After a reset, no matter how it was
initiated (the device itself or the HBA) the device will give you a UNIT ATTENTION
if I remember well (25 years ago). 

>>>> I'd throw possible shade at the backplane or cable /but I have already
>>>> swapped both out for spares without any change in behavior./
>> What about the power supply? 
>> 
> There are multiple other devices and the system board on that supply
> (and thus voltage rails) but it too has been swapped out without
> change.  In fact at this point other than the system board and RAM
> (which is ECC, and is showing no errors in the system's BMC log)
> /everything /in the server case (HBA, SATA expander, backplane, power
> supply and cables) has been swapped for spares.  No change in behavior.
> 
> Note that with 20.0.7.0 firmware in the HBA instead of a unit attention
> I get a *controller* reset (!!!) which detaches some random number of
> devices from ZFS's point of view before it comes back up (depending on
> what's active at the time) which is potentially catastrophic if it hits
> the system pool.  I immediately went back to 19.0.0.0 firmware on the
> HBA; I had upgraded to 20.0.7.0 since there had been good reports of
> stability with it when I first saw this, thinking there was a drive
> change that might have resulted in issues with it when running 19.0
> firmware on the card.

I have a system running 12.0-RELEASE-p1 with a LSI2008, 15 SAS disks and a SATA SSD
and I haven’t seen any problems. This is heavily loaded with just 8 GB of memory and a lot
of stuff running. 

mps0: <Avago Technologies (LSI) SAS2008> port 0x9000-0x90ff mem 0xdfff0000-0xdff
fffff,0xdff80000-0xdffbffff irq 17 at device 0.0 numa-domain 0 on pci4
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

ses0:
	Enclosure Name: LSILOGIC SASX28 A.0 5021

> This system was completely stable for over a year on 11.1-STABLE and in
> fact hadn't been rebooted or logged a single "event" in over six months;
> the problems started immediately upon upgrade to 11.2-STABLE and
> persists on 12.0-STABLE.  The disks in question haven't changed either
> (so it can't be a difference in firmware that is in a newer purchased
> disk, for example.)

But you are right, a panic because of a disk problem points to a bug. As long as the ZFS
pool is usable, trouble with one of its disks should just be logged. Unless of course
the disk is used for swap or the disk failure leads to the system being unable to 
complete a page in. Again, it shouldn’t happen.

> I'm thinking perhaps *something* in the codebase change made the HBA ->
> SAS Expander combination trouble where it wasn't before; I've got a
> couple of 16i HBAs on the way which will allow me to remove the SAS
> expander to see if that causes the problem to disappear.  I've got a
> bunch of these Lenovo expanders and have been using them without any
> sort of trouble in multiple machines; it's only when I went beyond 11.1
> that I started having trouble with them.

It might be some backplane misbehavior triggering a bug, complicated.

Borja.