[Bug 257670] RAS CONTROLLER: Fatal unrecoverable error detected with SAS3008

From: <bugzilla-noreply_at_freebsd.org>
Date: Sat, 07 Aug 2021 07:21:34 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257670

            Bug ID: 257670
           Summary: RAS CONTROLLER: Fatal unrecoverable error detected
                    with SAS3008
           Product: Base System
           Version: CURRENT
          Hardware: arm64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: arm
          Assignee: freebsd-arm@FreeBSD.org
          Reporter: daniel@morante.net
 Attachment #227004 text/plain
         mime type:

Created attachment 227004
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=227004&action=edit
capture of boot via serial

I am testing FreeBSD-14.0-CURRENT-arm64-aarch64-20210805-f3a3b061216-248478 on
a Cavium ThunderX2 (Gigabyte R281-T91).  This system has an onboard SAS3008
PCI-Express Fusion-MPT SAS-3 controller.  

```
mpr0@pci0:14:0:0:       class=0x010700 rev=0x02 hdr=0x00 vendor=0x1000
device=0x0097 subvendor=0x1458 subdevice=0x3008
    vendor     = 'Broadcom / LSI'
    device     = 'SAS3008 PCI-Express Fusion-MPT SAS-3'
    class      = mass storage
    subclass   = SAS
```

I load the `mpr` driver by having `mpr_load="YES"` in `/boot/loader.conf`.  So
far so good except for the weird messages in dmesg. (see attachment)

There are currently 8 HDD's attached to it and I setup 3 ZFS pools.  This goes
well until I finally start to put some load on them.  The system kernel panics
and halts with the following in dmesg:

```
mpr0: IOC Fault 0x4000265d, Resetting
mpr0: Reinitializing controller
...
RAS CONTROLLER: Fatal unrecoverable error detected
```

This is not to say the problem is with ZFS.  I suspect the mpr driver is just
unstable.

The system can no longer boot into multi user mode.  It kernel panics with the
same error as soon as it tries to start ZFS.

```
mountroot: waiting for device /dev/nda0p2...
WARNING: / was not properly dismounted
Dual Console: Video Primary, Serial Secondary
witness_lock_list_get: witness exhausted
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
RAS CONTROLLER: Fatal unrecoverable error detected

        *** NBU Error ***
...
```

In order to get a functional system I disable ZFS in `/etc/rc.conf` while in
single user mode.

Now back in multi user mode I can do a `service zfs onestart` and try to import
one of the pools.  The system then kernel panics again.

I detail the full specs of this system in bug #254651 (where I have a problem
with the onboard SATA controllers) and in my forum post at
https://forums.freebsd.org/threads/aarch64-trouble-with-cn99xx-ahci-and-fastlinq-ql41000-controllers.79556/
(where I explain the lack of a driver for the onboard Ethernet).

Also, for some weird reason I can no longer boot 13.0-RELEASE on this system. 
It panics with "panic: NVME polled command failed to complete within 10s". I
think it doesn't like the add-on PCIe NVME.  However when it was working (prior
to adding in the NVME) the SAS controller was just as unstable.

Seeing how most of the hardware is still very new, I don't expect FreeBSD
(especcially arm64) to support it.  I'd like to help anyway that I can should
someone be interested. The system has an IPMI and I'd be willing to offer
remote access to it for as long as it's required via VPN (if that's a thing
that's normally done) on a dedicated network with any other required
resources).

-- 
You are receiving this mail because:
You are the assignee for the bug.