isp(4) QLE2462 initiator failure with 10.3-RELEASE

Robroy Gregg robroy at robroygregg.com
Thu Oct 6 17:27:30 UTC 2016


FreeBSD Friends,

I opened a FreeBSD Forums thread with this question on Monday.  I'm sorry 
to duplicate the question in two places, yet I figured I might have better 
luck with being noticed by developers here on the mailing lists.  Here's 
the thread:

     https://forums.freebsd.org/threads/57923/

A chum and I have been setting up some FreeBSD 10.3-RELEASE servers at 
work, which access ZFS pools on Hitachi Modular and Enterprise family 
arrays.  FreeBSD attaches to the Brocade fabric with a QLE2462 FC HBA, and 
sees four paths to each LU.

Here's a drawing of the basic idea:

     http://www.robroygregg.com/misc/2016Oct03.PNG

The drawing leaves out a few more arrays (of the same types), and various 
switches in the fabric between the arrays and the two 6510s (in the 
drawing).

===== The Problem =====

The first FC HBA port, isp0 stopped working spontaneously, after several 
weeks of uptime with light I/O.  All LU paths automatically failed over to 
isp1, yet paths through isp0 remain non-functional even now.

The first sign of trouble appeared in /var/log/messages, followed by many 
more similar errors for other LU paths:

   isp0: Chan 0 Abort Cmd for N-Port 0x0005 @ Port 0x111300
   isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
   isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
   isp0: isp_watchdog: timeout for handle 0x6570200d
   (da5:isp0:0:4:1): FIN dl16384 resid 0 CDB=0x2a 0x00 0x03 0x51 0x1b 0xe5 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
   (da5:isp0:0:4:1): WRITE(10). CDB: 2a 00 03 51 1b e5 00 00 20 00
   (da5:isp0:0:4:1): CAM status: Command timeout
   (da5:isp0:0:4:1): Retrying command

These caused successful fail-overs to paths through isp1, which looked 
like this in /var/log/messages:

   (da5:isp0:0:4:1): Error 5, Retries exhausted
   GEOM_MULTIPATH: Error 5, da5 in 85040360_0999 marked FAIL
   GEOM_MULTIPATH: da17 is now active path in 85040360_0999

===== What I've already tried =====

   * I tried manually failing back to paths through isp0 with commands
     like "gmultipath restore 66209_002E da2" followed by "gmultipath
     rotate 66209_002E."  When I/Os are tried over isp0, it shows the same,
     original symptom (shown below in context), until it fails back to a path
     through isp1.

     GEOM_MULTIPATH: da3 in 66209_002E is marked OK.
     GEOM_MULTIPATH: da3 is now active path in 66209_002E
     isp0: Chan 0 Abort Cmd for N-Port 0x0004 @ Port 0x0e2000
     isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
     isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
     isp0: isp_watchdog: timeout for handle 0x65a7200d
     (da3:isp0:0:3:0): FIN dl2560 resid 0 CDB=0x2a 0x00 0x04 0x2a 0xa7 0x89 0x00 0x00 0x05 0x00  STS 0x0 XS_ERR=0xb
     (da3:isp0:0:3:0): WRITE(10). CDB: 2a 00 04 2a a7 89 00 00 05 00
     (da3:isp0:0:3:0): CAM status: Command timeout
     (da3:isp0:0:3:0): Retrying command

   * I've tried failing over to every possible array target for an
     LU, over isp0; it was the same for each target.

   * I've tried replacing every fiber optic cabling segment between
     the isp0 HBA port and the switch; the behavior was unchanged.

   * I've tried physically swapping the isp0 and isp1 HBA port
     connections--the symptom stuck to isp0, even when its I/Os were
     being attempted through the physical connection formerly used
     (successfully) by isp1.

   * I've tried disabling and re-enabling the Brocade switch port.
     When the port was enabled, it assumed the "In_Sync" state
     (instead of the "Online" state it shows when it's working):

     2   2   150200   id    N4       In_Sync     FC

===== Computer information =====

This is a Hitachi CR220H, which is based on an MSI S0051a motherboard.

===== FC HBA information =====

This is a QLE2462 at firmware level 8.01.02 and BIOS level 3.29. 
ispfw(4)'s being used, and claims to have successfully placed its own 
firmware on the card during boot, presumably over-riding the levels I 
flashed (mentioned here).

Related sysctls:

   # sysctl -a | grep dev.isp
   dev.isp.1.topo: 3
   dev.isp.1.loopstate: 9
   dev.isp.1.fwstate: 3
   dev.isp.1.linkstate: 1
   dev.isp.1.speed: 4
   dev.isp.1.role: 2
   dev.isp.1.gone_device_time: 30
   dev.isp.1.loop_down_limit: 60
   dev.isp.1.wwpn: 2378182195041974935
   dev.isp.1.wwnn: 2305843126027336343
   dev.isp.1.%parent: pci3
   dev.isp.1.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
   dev.isp.1.%location: pci0:3:0:1
   dev.isp.1.%driver: isp
   dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
   dev.isp.0.topo: 3
   dev.isp.0.loopstate: 9
   dev.isp.0.fwstate: 3
   dev.isp.0.linkstate: 1
   dev.isp.0.speed: 4
   dev.isp.0.role: 2
   dev.isp.0.gone_device_time: 30
   dev.isp.0.loop_down_limit: 60
   dev.isp.0.wwpn: 2377900720063167127
   dev.isp.0.wwnn: 2305843126025239191
   dev.isp.0.%parent: pci3
   dev.isp.0.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
   dev.isp.0.%location: pci0:3:0:0
   dev.isp.0.%driver: isp
   dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
   dev.isp.%parent:

===== FC switch information =====

Each FC HBA port's attached to a (separate) Brocade 6510 running FOS 
v7.4.1.  The symptom's not specific to either of these switches (I tried 
swapping the connections around, and the symptom stuck to isp0).

===== Array information =====

LUs from both Hitachi Modular (AMS) and Enterprise (VSP) arrays are 
visible over the QLE2462.  When this problem happens, the behavior's 
uniform for all array paths; the symptom's not specific to any one array, 
or array family.

===== What's happening now =====

I'm guessing that this problem would temporarily go away if I rebooted the 
computer, yet we won't be able to continue on with the project until we 
figure out what happened to isp0--we're afraid that it'll happen again, 
naturally at the most inopportune time possible.  So the computer's still 
in its problem state now.

Thanks so very much!
Robroy

Robroy Gregg
Salinas, California


More information about the freebsd-questions mailing list