isp(4) QLE2462 initiator failure with 10.3-RELEASE
Robroy Gregg
robroy at robroygregg.com
Thu Oct 6 17:27:30 UTC 2016
FreeBSD Friends,
I opened a FreeBSD Forums thread with this question on Monday. I'm sorry
to duplicate the question in two places, yet I figured I might have better
luck with being noticed by developers here on the mailing lists. Here's
the thread:
https://forums.freebsd.org/threads/57923/
A chum and I have been setting up some FreeBSD 10.3-RELEASE servers at
work, which access ZFS pools on Hitachi Modular and Enterprise family
arrays. FreeBSD attaches to the Brocade fabric with a QLE2462 FC HBA, and
sees four paths to each LU.
Here's a drawing of the basic idea:
http://www.robroygregg.com/misc/2016Oct03.PNG
The drawing leaves out a few more arrays (of the same types), and various
switches in the fabric between the arrays and the two 6510s (in the
drawing).
===== The Problem =====
The first FC HBA port, isp0 stopped working spontaneously, after several
weeks of uptime with light I/O. All LU paths automatically failed over to
isp1, yet paths through isp0 remain non-functional even now.
The first sign of trouble appeared in /var/log/messages, followed by many
more similar errors for other LU paths:
isp0: Chan 0 Abort Cmd for N-Port 0x0005 @ Port 0x111300
isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
isp0: isp_watchdog: timeout for handle 0x6570200d
(da5:isp0:0:4:1): FIN dl16384 resid 0 CDB=0x2a 0x00 0x03 0x51 0x1b 0xe5 0x00 0x00 0x20 0x00 STS 0x0 XS_ERR=0xb
(da5:isp0:0:4:1): WRITE(10). CDB: 2a 00 03 51 1b e5 00 00 20 00
(da5:isp0:0:4:1): CAM status: Command timeout
(da5:isp0:0:4:1): Retrying command
These caused successful fail-overs to paths through isp1, which looked
like this in /var/log/messages:
(da5:isp0:0:4:1): Error 5, Retries exhausted
GEOM_MULTIPATH: Error 5, da5 in 85040360_0999 marked FAIL
GEOM_MULTIPATH: da17 is now active path in 85040360_0999
===== What I've already tried =====
* I tried manually failing back to paths through isp0 with commands
like "gmultipath restore 66209_002E da2" followed by "gmultipath
rotate 66209_002E." When I/Os are tried over isp0, it shows the same,
original symptom (shown below in context), until it fails back to a path
through isp1.
GEOM_MULTIPATH: da3 in 66209_002E is marked OK.
GEOM_MULTIPATH: da3 is now active path in 66209_002E
isp0: Chan 0 Abort Cmd for N-Port 0x0004 @ Port 0x0e2000
isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
isp0: isp_watchdog: timeout for handle 0x65a7200d
(da3:isp0:0:3:0): FIN dl2560 resid 0 CDB=0x2a 0x00 0x04 0x2a 0xa7 0x89 0x00 0x00 0x05 0x00 STS 0x0 XS_ERR=0xb
(da3:isp0:0:3:0): WRITE(10). CDB: 2a 00 04 2a a7 89 00 00 05 00
(da3:isp0:0:3:0): CAM status: Command timeout
(da3:isp0:0:3:0): Retrying command
* I've tried failing over to every possible array target for an
LU, over isp0; it was the same for each target.
* I've tried replacing every fiber optic cabling segment between
the isp0 HBA port and the switch; the behavior was unchanged.
* I've tried physically swapping the isp0 and isp1 HBA port
connections--the symptom stuck to isp0, even when its I/Os were
being attempted through the physical connection formerly used
(successfully) by isp1.
* I've tried disabling and re-enabling the Brocade switch port.
When the port was enabled, it assumed the "In_Sync" state
(instead of the "Online" state it shows when it's working):
2 2 150200 id N4 In_Sync FC
===== Computer information =====
This is a Hitachi CR220H, which is based on an MSI S0051a motherboard.
===== FC HBA information =====
This is a QLE2462 at firmware level 8.01.02 and BIOS level 3.29.
ispfw(4)'s being used, and claims to have successfully placed its own
firmware on the card during boot, presumably over-riding the levels I
flashed (mentioned here).
Related sysctls:
# sysctl -a | grep dev.isp
dev.isp.1.topo: 3
dev.isp.1.loopstate: 9
dev.isp.1.fwstate: 3
dev.isp.1.linkstate: 1
dev.isp.1.speed: 4
dev.isp.1.role: 2
dev.isp.1.gone_device_time: 30
dev.isp.1.loop_down_limit: 60
dev.isp.1.wwpn: 2378182195041974935
dev.isp.1.wwnn: 2305843126027336343
dev.isp.1.%parent: pci3
dev.isp.1.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
dev.isp.1.%location: pci0:3:0:1
dev.isp.1.%driver: isp
dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.0.topo: 3
dev.isp.0.loopstate: 9
dev.isp.0.fwstate: 3
dev.isp.0.linkstate: 1
dev.isp.0.speed: 4
dev.isp.0.role: 2
dev.isp.0.gone_device_time: 30
dev.isp.0.loop_down_limit: 60
dev.isp.0.wwpn: 2377900720063167127
dev.isp.0.wwnn: 2305843126025239191
dev.isp.0.%parent: pci3
dev.isp.0.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
dev.isp.0.%location: pci0:3:0:0
dev.isp.0.%driver: isp
dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.%parent:
===== FC switch information =====
Each FC HBA port's attached to a (separate) Brocade 6510 running FOS
v7.4.1. The symptom's not specific to either of these switches (I tried
swapping the connections around, and the symptom stuck to isp0).
===== Array information =====
LUs from both Hitachi Modular (AMS) and Enterprise (VSP) arrays are
visible over the QLE2462. When this problem happens, the behavior's
uniform for all array paths; the symptom's not specific to any one array,
or array family.
===== What's happening now =====
I'm guessing that this problem would temporarily go away if I rebooted the
computer, yet we won't be able to continue on with the project until we
figure out what happened to isp0--we're afraid that it'll happen again,
naturally at the most inopportune time possible. So the computer's still
in its problem state now.
Thanks so very much!
Robroy
Robroy Gregg
Salinas, California
More information about the freebsd-questions
mailing list