[Bug 255930] ocs_fc Lost all connected devices after some use.

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 16 May 2021 18:30:53 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255930

            Bug ID: 255930
           Summary: ocs_fc Lost all connected devices after some use.
           Product: Base System
           Version: Unspecified
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: arne@Steinkamm.COM

Created attachment 225001
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=225001&action=edit
Message file with all described problems. See Bug reports for time stamps

I connected a HP Proliant 380 Gen9 server with emulex fc HBAs to two simple fc
setups and attached a NetApp FlashFiler EF550 unit. To get the most out of ZFS
I assigned all 24 flash modules without using the EF550 RAID features to the
proliant.

I use geom_multipath to handle the redundant connections to the flash filer and
made a ZFS Pool with 3 x 7-disk raidz-1, one spare, one log and one cache
disks.

The read/write speed is good (2.5 GB/s according to zpool iostat) but after
minutes of heavy use I got
kernel: ocs_fc0: ocs_initiator_io: device LOST 0 messages and all fc connected
disks are gone.

I found no way to recover out of this error situation other than reboot, panic
(zfs is not happy about the situation) or hardware reset.

Further obervations:

- reported topologies and link speeds are correct.

- ef550 replaced with identical spare unit: no change

- changed fc ports: no effect

- used different emulex cards (alone, mixed): no effect, problem happens with
any combination of installed emulex cards

- tried qlogic cards (driver: isp(4)): No problems, works 100% stable but
slightly slower io performance.

- tried 12.1-RELEASE, 12.2-RELEASE and 13.0-RELEASE. Last one with generic
kernel without any changes. Every time lost all fc devices.

- Boot with disabled switch fc ports:
  After portenable of the brokades' ports the fc links went up, no automatic
attachment of the disks.
  A camcontrol rescan all was not successfull, thousands of "device not ready"
messages flooded the console.
  The only way to get the flash modules online is to boot the server with
working fc setup.

- Bumping the emulex cards to the newest available firmware had no visible
effect.

- Playing with the HBA related BIOS settings
  "HP Shared Memory Feature", "Brocade FA-PWWN" and "PLOGT Retry Timer" had no
visible effect.


More details of the last try with 13.0-RELEASE generic:

uname -a:
FreeBSD vwcnctd00fs003.dev.kpdm01.group.vwg 13.0-RELEASE FreeBSD 13.0-RELEASE
#0 releng/13.0-n244733-ea31abc261f: Fri Apr  9 04:24:09 UTC 2021    
root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

pciconf -lv:

ocs_fc0@pci0:8:0:0:     class=0x0c0400 rev=0x01 hdr=0x00 vendor=0x10df
device=0xe300 subvendor=0x1590 subdevice=0x0214
    vendor     = 'Emulex Corporation'
    device     = 'LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel
ocs_fc1@pci0:8:0:1:     class=0x0c0400 rev=0x01 hdr=0x00 vendor=0x10df
device=0xe300 subvendor=0x1590 subdevice=0x0214
    vendor     = 'Emulex Corporation'
    device     = 'LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel
ocs_fc2@pci0:129:0:0:   class=0x0c0400 rev=0x30 hdr=0x00 vendor=0x10df
device=0xe200 subvendor=0x103c subdevice=0x197f
    vendor     = 'Emulex Corporation'
    device     = 'LPe15000/LPe16000 Series 8Gb/16Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel
ocs_fc3@pci0:129:0:1:   class=0x0c0400 rev=0x30 hdr=0x00 vendor=0x10df
device=0xe200 subvendor=0x103c subdevice=0x197f
    vendor     = 'Emulex Corporation'
    device     = 'LPe15000/LPe16000 Series 8Gb/16Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel

HP device names:
HPE SN1200E 16Gb 2p FC HBA Product Part Number: Q0L14-63001 Assembly Number
870002-001
HP SN1100E 16Gb 2P FC HBA  Product Part Number: C8R39-60001 Assembly Number:
719212-001

The EF550 has two independent controllers both connected to all flash module
bays. Each controller has two FC ports.
This ports are connected to two independent brocade fc switches (no interlink
fibre).
One port of each emulex card is connected to one of the fc switches.
The other port of each emulex card is not in use (connected to an enterprise
fabric network independent from my laborotry setup, but ports are disabled on
the switch site).
Using only on of the emulex cards does not change the effect. I tryed all
permutations possible.

To get valid data for this bug report I installed 13.0-release with minimal
setup:


/boot/device.hints:
hint.ocs_fc.0.initiator="1"
hint.ocs_fc.2.initiator="1"
hint.ocs_fc.0.topology="1"
hint.ocs_fc.2.topology="1"
hint.ocs_fc.0.speed="16000"
hint.ocs_fc.2.speed="16000"

/etc/sysctl.conf:
dev.ocs_fc.1.port_state=offline
dev.ocs_fc.3.port_state=offline


In the attached messages File you will find this:

May 15 19:21:43 - 19:29:22
First boot and configuring network connectivity on the shell.

May 15 19:44:24 Enabling FC ports on both brocades

May 15 19:47:21 camcontrol rescan all (all rescans successful according to
camcontrol)

May 15 19:59:15 reboot --- Now with enabled FC links. It will find the flash
modules

May 15 20:06:36 kldload geom_multipath.ko

geom_multipath finds four preconfigured links to each flash module. This is
correct.

No I did a zpool import zone and startet a couple of test tools
Output of zpool iostat zone 1:
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zone        14.7T   486G  40.1K      0  2.61G      0
zone        14.7T   486G  39.0K    436  2.58G  1.94M
zone        14.7T   486G  41.6K      0  2.60G      0
zone        14.7T   486G  39.4K      0  2.60G      0
zone        14.7T   486G  39.4K      0  2.62G      0
zone        14.7T   486G  40.7K      0  2.57G      0
zone        14.7T   486G  39.9K    420  2.54G  1.94M
zone        14.7T   486G  39.5K      0  2.58G      0
zone        14.7T   486G  39.6K      0  2.64G      0
zone        14.7T   486G  39.3K      0  2.57G      0
zone        14.7T   486G  39.4K      0  2.62G      0
...


May 15 20:15:15 The problem starts

May 15 20:16:18 attempt of a camcontrol rescan with no success

My short term solution is to use QLogic cards with the isp driver which works
without any changes necessary 100% stable.

-- 
You are receiving this mail because:
You are the assignee for the bug.