System Hang / kernel panic with new mps driver

John jwd at freebsd.org
Tue Feb 28 23:15:42 UTC 2012


Hi Folks,

   I realize there has been some discussion going on related to the new
mps driver recently. Today I caught a kernel panic when we lost a multipath
channel and then fixed it (replaced the cable).

   The complete dmesg from a system boot with hw.mps.debug_level=0xd (it's large):

http://people.freebsd.org/~jwd/mpsdmesg.txt

   Kernel is 9-stable: r231864M /usr/obj/usr/src.2012-02-17_01.26.11/sys/ZFS 

   When the cable was replaced, the system started throwing messages
to the console quite fast:

Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq
Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800136b240 allocated tm 0xffffff800132f528
Feb 28 18:33:24 testsys kernel: mps1: mpssas_scsiio_timeout checking sc 0xffffff800131a000 cm 0xffffff800135ee38
Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 776 completed timedout cm 0xffffff800136b240 ccb 0xfffffe00612f
4800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 29 abort TaskMID 776 status 0x0 code 0x0 count 1
Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 29 finished recovery after aborting TaskMID 776
Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq
Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 command timeout cm 0xffffff800135ee38 ccb 0xfffffe12d189680
0
Feb 28 18:33:24 testsys kernel: mps1: mpssas_alloc_tm freezing simq
Feb 28 18:33:24 testsys kernel: mps1: timedout cm 0xffffff800135ee38 allocated tm 0xffffff800132f670
Feb 28 18:33:24 testsys kernel: (xpt0:mps1:0:67:0): SERVICE ACTION IN(16). CDB: 9e 10 0 0 0 0 0 0 0 0 0 0 0 d 0 0 length 13 SMID 623 completed timedout cm 0xf
Feb 28 18:33:24 testsys kernel: fffff800135ee38 ccb 0xfffffe12d1896800 during recovery ioc 8048 s(noperiph:mps1:0:67:0): SMID 30 abort TaskMID 623 status 0x0 code 0x0 count 1
Feb 28 18:33:24 testsys kernel: (noperiph:mps1:0:67:0): SMID 30 finished recovery after aborting TaskMID 623
Feb 28 18:33:24 testsys kernel: mps1: mpssas_free_tm releasing simq

   Many, many of the above.. eventually followed by:

Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0102 to persistent table because there is no free space available
Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0104 to persistent table because there is no free space available
Feb 28 18:35:07 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0105 to persistent table because there is no free space available
...
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0132 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0133 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0134 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0135 to persistent table because there is no free space available

   more mps CDB messages, and then

Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0136 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0137 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0138 to persistent table because there is no free space available
...
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0150 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0151 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0152 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0153 to persistent table because there is no free space available
Feb 28 18:35:08 testsys kernel: _mapping_add_new_device: failed to add the device with handle 0x0155 to persistent table because there is no free space available

   At this point the kernel panic'd. (typed from a phone pic - sorry)

Fatal trap 12: page fault while in kernel mode
...
Stopped at xpt_compile_path+0x87: cmpl 0x50(%rbx),%r11d

bt
xpt_compile_path() at xpt_compile_path+0x87
xpt_create_path() at xpt_create_path+0x66
mps_rescan_done() at mpssas_rescan_done+0x16a
camisr_runqueue() at camisr_runqueue+0x63
intr_event_execute_handlers() at intr_event_execute_handlers+0x104
ithread_loop() at ithread_loop+0xa4
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe

Not sure if it makes a difference, but during a normal (clean boot), the above
'failed to add the device' messages show up 160 times. This doesn't appear
to be a #define value, but a 'fact' from the adapter. Since all the shelves
are dual attached and every disk shows up twice, could this make a difference
to the initialization logic?  (ses devices show up twice also).

Any thoughts are welcome. The system is available for testing.

Thanks,
John


More information about the freebsd-scsi mailing list