mfi panic on recused on non-recusive mutex MFI I/O lock

Fri Nov 9 17:06:04 UTC 2012

----- Original Message ----- 
From: "Steven Hartland"
...
> I've just had another panic, trace below, but it doesn't seem to be related
> to my changes so I'd appreciate your feedback on them as they are for now.
> 
> While the lock patch fixes the problems I've seen, its not clear to me
> why mfi_tbolt_reset is acquiring the lock and hence requiring
> mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
> around queue manipulation is done correctly. Given what its doing
> (resetting the entire adapter) I wouldn't be surprised if it should
> really be acquiring the config lock.
> 
> Other things I've noticed / questions
> * Should mfi_abort sleep even if its call to mfi_mapcmd fails?
> * Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
> * Do these controllers not support none 512 byte requests? Currently
> all syspd requests are done assuming 512 byte sectors which the disk may
> not be. This will both reduce performance or potentially break totally
> if the firmware isn't translating it under the surface correctly.
> 
> Anyway the new panic manually transcribed is:-
> panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm
> ...
> mfi_tbolt_get_cmd()
> mfi_build_mpt_pass_thru()
> mfi_tbolt_build_mpt_cmd()
> mfi_tbolt_send_frame()
> bus_dmamap_load()
> mfi_mapcmd()
> mfi_startio()
> mfi_syspd_strategy()
> g_disk_start()
> g_io_schedule_down()
> g_down_proc_body()
> fork_exit()
> fork_trampoline()
> 
> Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
> can tell all manip is done using the TAILQ macros and under mfi_io_lock
> so its not obvious to me at this time why this is, any ideas?

I've gone through looking for the possible cause of this and while there's
nothing directly connected to the manip of this queue I've found and fixed
quite a large number of additional problems which may have been indirectly
causing this problem.

The biggest change is to use mfi_max_cmds to limit the value stored in
sc->mfi_max_fw_cmds as this is used extensively throughout the driver
for allocation and range checks so having this inconsitently set opened up
a large number of possible overrun errors.

The new patch attached documents all the changes in detail.

I've managed to do one test run so far which failed to reproduce any panics,
so definitely moving in the right direction :)

The machine has now been collected for repair by the supplier but I'm going
to try and get them to put it online for more testing over the weekend.

Given the failure rate so far if I can do another 4 runs with no panics I'd
be happy that the majority of error conditions are working as expected.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zz-mfi-queue.patch
Type: application/octet-stream
Size: 23757 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20121109/53c1de25/attachment.obj>