mfi panic on recused on non-recusive mutex MFI I/O lock

Wed Nov 14 17:59:37 UTC 2012

----- Original Message ----- 
From: "Doug Ambrisko" <ambrisko at ambrisko.com>
To: "Steven Hartland" <killing at multiplay.co.uk>
Cc: <freebsd-scsi at freebsd.org>; <freebsd-stable at freebsd.org>
Sent: Friday, November 09, 2012 5:25 PM
Subject: Re: mfi panic on recused on non-recusive mutex MFI I/O lock

> On Fri, Nov 09, 2012 at 05:06:03PM -0000, Steven Hartland wrote:
> | 
> | ----- Original Message ----- 
> | From: "Steven Hartland"
> | ...
> | >I've just had another panic, trace below, but it doesn't seem to be related
> | >to my changes so I'd appreciate your feedback on them as they are for now.
> | >
> | >While the lock patch fixes the problems I've seen, its not clear to me
> | >why mfi_tbolt_reset is acquiring the lock and hence requiring
> | >mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
> | >around queue manipulation is done correctly. Given what its doing
> | >(resetting the entire adapter) I wouldn't be surprised if it should
> | >really be acquiring the config lock.
> | >
> | >Other things I've noticed / questions
> | >* Should mfi_abort sleep even if its call to mfi_mapcmd fails?
> | >* Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
> | >* Do these controllers not support none 512 byte requests? Currently
> | >all syspd requests are done assuming 512 byte sectors which the disk may
> | >not be. This will both reduce performance or potentially break totally
> | >if the firmware isn't translating it under the surface correctly.
> | >
> | >Anyway the new panic manually transcribed is:-
> | >panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm
> | >...
> | >mfi_tbolt_get_cmd()
> | >mfi_build_mpt_pass_thru()
> | >mfi_tbolt_build_mpt_cmd()
> | >mfi_tbolt_send_frame()
> | >bus_dmamap_load()
> | >mfi_mapcmd()
> | >mfi_startio()
> | >mfi_syspd_strategy()
> | >g_disk_start()
> | >g_io_schedule_down()
> | >g_down_proc_body()
> | >fork_exit()
> | >fork_trampoline()
> | >
> | >Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
> | >can tell all manip is done using the TAILQ macros and under mfi_io_lock
> | >so its not obvious to me at this time why this is, any ideas?
> | 
> | I've gone through looking for the possible cause of this and while there's
> | nothing directly connected to the manip of this queue I've found and fixed
> | quite a large number of additional problems which may have been indirectly
> | causing this problem.
> | 
> | The biggest change is to use mfi_max_cmds to limit the value stored in
> | sc->mfi_max_fw_cmds as this is used extensively throughout the driver
> | for allocation and range checks so having this inconsitently set opened up
> | a large number of possible overrun errors.
> | 
> | The new patch attached documents all the changes in detail.
> | 
> | I've managed to do one test run so far which failed to reproduce any panics,
> | so definitely moving in the right direction :)
> | 
> | The machine has now been collected for repair by the supplier but I'm going
> | to try and get them to put it online for more testing over the weekend.
> | 
> | Given the failure rate so far if I can do another 4 runs with no panics I'd
> | be happy that the majority of error conditions are working as expected.
> 
> Sounds like you have made some good progress.  I looked at your prior locking
> change and they good.  Haven't had time to go through the queue changes
> yet.

Just to update people on this, as its taken quite some time to track down the
random issues causing panics, but I believe I made a breakthrough last night.

It seems that the cleanup interation between mfi_cmd's and tbolt_cmd's is flawed
meaning its possible that tbolt commands are processed after the caller has
already recieved a response, cleaned and returned the mfi_cmd to the free queue.

This means that its anyones guess what the result of the tbolt cleanup is as it
could well be operating on a mfi_cmd thats either now in the free queue or even
worse has already been reused.

It also possible this was the underling issue you may well have seening which
caused you to add the mfi_tbolt_complete_cmd calls to mfi_tbolt_send_frame
in r242681.

If this is correct then I believe the correct fix is to ensure that
mfi_tbolt_return_cmd is only ever called from mfi_release_command thus ensuring
completion ordering is always correct. I'm testing fixes for this theory now
but initial debug has had good results.

The patch of fixes is really growing, so definitely going to need someone to
review in detail when I'm done.

What do you think of the above, does it make sence? Would you be willing to
review the patch when I'm done, before I commit it Doug?

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.