isp(4) QLE2462 initiator failure with 10.3-RELEASE

Thu Oct 6 17:27:30 UTC 2016

FreeBSD Friends,

I opened a FreeBSD Forums thread with this question on Monday.  I'm sorry 
to duplicate the question in two places, yet I figured I might have better 
luck with being noticed by developers here on the mailing lists.  Here's 
the thread:

     https://forums.freebsd.org/threads/57923/

A chum and I have been setting up some FreeBSD 10.3-RELEASE servers at 
work, which access ZFS pools on Hitachi Modular and Enterprise family 
arrays.  FreeBSD attaches to the Brocade fabric with a QLE2462 FC HBA, and 
sees four paths to each LU.

Here's a drawing of the basic idea:

     http://www.robroygregg.com/misc/2016Oct03.PNG

The drawing leaves out a few more arrays (of the same types), and various 
switches in the fabric between the arrays and the two 6510s (in the 
drawing).

===== The Problem =====

The first FC HBA port, isp0 stopped working spontaneously, after several 
weeks of uptime with light I/O.  All LU paths automatically failed over to 
isp1, yet paths through isp0 remain non-functional even now.

The first sign of trouble appeared in /var/log/messages, followed by many 
more similar errors for other LU paths:

   isp0: Chan 0 Abort Cmd for N-Port 0x0005 @ Port 0x111300
   isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
   isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
   isp0: isp_watchdog: timeout for handle 0x6570200d
   (da5:isp0:0:4:1): FIN dl16384 resid 0 CDB=0x2a 0x00 0x03 0x51 0x1b 0xe5 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
   (da5:isp0:0:4:1): WRITE(10). CDB: 2a 00 03 51 1b e5 00 00 20 00
   (da5:isp0:0:4:1): CAM status: Command timeout
   (da5:isp0:0:4:1): Retrying command

These caused successful fail-overs to paths through isp1, which looked 
like this in /var/log/messages:

   (da5:isp0:0:4:1): Error 5, Retries exhausted
   GEOM_MULTIPATH: Error 5, da5 in 85040360_0999 marked FAIL
   GEOM_MULTIPATH: da17 is now active path in 85040360_0999

===== What I've already tried =====

   * I tried manually failing back to paths through isp0 with commands
     like "gmultipath restore 66209_002E da2" followed by "gmultipath
     rotate 66209_002E."  When I/Os are tried over isp0, it shows the same,
     original symptom (shown below in context), until it fails back to a path
     through isp1.

     GEOM_MULTIPATH: da3 in 66209_002E is marked OK.
     GEOM_MULTIPATH: da3 is now active path in 66209_002E
     isp0: Chan 0 Abort Cmd for N-Port 0x0004 @ Port 0x0e2000
     isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
     isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
     isp0: isp_watchdog: timeout for handle 0x65a7200d
     (da3:isp0:0:3:0): FIN dl2560 resid 0 CDB=0x2a 0x00 0x04 0x2a 0xa7 0x89 0x00 0x00 0x05 0x00  STS 0x0 XS_ERR=0xb
     (da3:isp0:0:3:0): WRITE(10). CDB: 2a 00 04 2a a7 89 00 00 05 00
     (da3:isp0:0:3:0): CAM status: Command timeout
     (da3:isp0:0:3:0): Retrying command

   * I've tried failing over to every possible array target for an
     LU, over isp0; it was the same for each target.

   * I've tried replacing every fiber optic cabling segment between
     the isp0 HBA port and the switch; the behavior was unchanged.

   * I've tried physically swapping the isp0 and isp1 HBA port
     connections--the symptom stuck to isp0, even when its I/Os were
     being attempted through the physical connection formerly used
     (successfully) by isp1.

   * I've tried disabling and re-enabling the Brocade switch port.
     When the port was enabled, it assumed the "In_Sync" state
     (instead of the "Online" state it shows when it's working):

     2   2   150200   id    N4       In_Sync     FC

===== Computer information =====

This is a Hitachi CR220H, which is based on an MSI S0051a motherboard.

===== FC HBA information =====

This is a QLE2462 at firmware level 8.01.02 and BIOS level 3.29. 
ispfw(4)'s being used, and claims to have successfully placed its own 
firmware on the card during boot, presumably over-riding the levels I 
flashed (mentioned here).

Related sysctls:

   # sysctl -a | grep dev.isp
   dev.isp.1.topo: 3
   dev.isp.1.loopstate: 9
   dev.isp.1.fwstate: 3
   dev.isp.1.linkstate: 1
   dev.isp.1.speed: 4
   dev.isp.1.role: 2
   dev.isp.1.gone_device_time: 30
   dev.isp.1.loop_down_limit: 60
   dev.isp.1.wwpn: 2378182195041974935
   dev.isp.1.wwnn: 2305843126027336343
   dev.isp.1.%parent: pci3
   dev.isp.1.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
   dev.isp.1.%location: pci0:3:0:1
   dev.isp.1.%driver: isp
   dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
   dev.isp.0.topo: 3
   dev.isp.0.loopstate: 9
   dev.isp.0.fwstate: 3
   dev.isp.0.linkstate: 1
   dev.isp.0.speed: 4
   dev.isp.0.role: 2
   dev.isp.0.gone_device_time: 30
   dev.isp.0.loop_down_limit: 60
   dev.isp.0.wwpn: 2377900720063167127
   dev.isp.0.wwnn: 2305843126025239191
   dev.isp.0.%parent: pci3
   dev.isp.0.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
   dev.isp.0.%location: pci0:3:0:0
   dev.isp.0.%driver: isp
   dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
   dev.isp.%parent:

===== FC switch information =====

Each FC HBA port's attached to a (separate) Brocade 6510 running FOS 
v7.4.1.  The symptom's not specific to either of these switches (I tried 
swapping the connections around, and the symptom stuck to isp0).

===== Array information =====

LUs from both Hitachi Modular (AMS) and Enterprise (VSP) arrays are 
visible over the QLE2462.  When this problem happens, the behavior's 
uniform for all array paths; the symptom's not specific to any one array, 
or array family.

===== What's happening now =====

I'm guessing that this problem would temporarily go away if I rebooted the 
computer, yet we won't be able to continue on with the project until we 
figure out what happened to isp0--we're afraid that it'll happen again, 
naturally at the most inopportune time possible.  So the computer's still 
in its problem state now.

Thanks so very much!
Robroy

Robroy Gregg
Salinas, California