Stop scheduler on panic
Kostik Belousov
kostikbel at gmail.com
Thu Nov 17 08:15:41 UTC 2011
On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote:
> On 17.11.2011 00:21, Andriy Gapon wrote:
> >on 16/11/2011 21:27 Fabian Keil said the following:
> >>Kostik Belousov<kostikbel at gmail.com> wrote:
> >>
> >>>I was tricked into finishing the work by Andrey Gapon, who developed
> >>>the patch to reliably stop other processors on panic. The patch
> >>>greatly improves the chances of getting dump on panic on SMP host.
> >>
> >>I tested the patch trying to get a dump (from the debugger) for
> >>kern/162036, which currently results in the double fault reported in:
> >>http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html
> >>
> >>It didn't help, but also didn't make anything worse.
> >>
> >>Fabian
> >
> >The mi_switch recursion looks very familiar to me:
> >mi_switch() at mi_switch+0x270
> >critical_exit() at critical_exit+0x9b
> >spinlock_exit() at spinlock_exit+0x17
> >mi_switch() at mi_switch+0x275
> >critical_exit() at critical_exit+0x9b
> >spinlock_exit() at spinlock_exit+0x17
> >[several pages of the previous three lines skipped]
> >mi_switch() at mi_switch+0x275
> >critical_exit() at critical_exit+0x9b
> >spinlock_exit() at spinlock_exit+0x17
> >intr_even_schedule_thread() at intr_event_schedule_thread+0xbb
> >ahci_end_transaction() at ahci_end_transaction+0x398
> >ahci_ch_intr() at ahci_ch_intr+0x2b5
> >ahcipoll() at ahcipoll+0x15
> >xpt_polled_action() at xpt_polled_action+0xf7
> >
> >In fact I once discussed with jhb this recursion triggered from a different
> >place. To quote myself:
> ><avg> spinlock_exit -> critical_exit -> mi_switch -> kdb_switch ->
> >thread_unlock -> spinlock_exit -> critical_exit -> mi_switch -> ...
> ><avg> in the kdb context
> ><avg> this issue seems to be triggered by td_owepreempt being true at
> >the time
> >kdb is entered
> ><avg> and there of course has to be an initial spinlock_exit call
> >somewhere
> ><avg> in my case it's because of usb keyboard
> ><avg> I wonder if it would make sense to clear td_owepreempt right
> >before
> >calling kdb_switch in mi_switch
> ><avg> instead of in sched_switch()
> ><avg> clearing td_owepreempt seems like a scheduler-independent
> >operation to me
> ><avg> or is it better to just skip locking in usb when kdb_active is set
> ><avg> ?
> >
> >The workaround described above should work in this case.
> >Another possibility is to pessimize mtx_unlock_spin() implementations to
> >check
> >SCHEDULER_STOPPED() and to bypass any further actions in that case. But
> >that
> >would add unnecessary overhead to the sunny day code paths.
> >
> >Going further up the stack one can come up with the following proposals:
> >- check SCHEDULER_STOPPED() swi_sched() and return early
> >- do not call swi_sched() from xpt_done() if we somehow know that we are
> >in a
> >polling mode
>
> There is no flag in CAM now to indicate polling mode, but if needed, it
> should not be difficult to add one and not call swi_sched().
I have the following change for eons on my test boxes. Without it,
I simply cannot get _any_ dump.
diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c
index 10b89c7..a38e42f 100644
--- a/sys/cam/cam_xpt.c
+++ b/sys/cam/cam_xpt.c
@@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb)
TAILQ_INSERT_TAIL(&cam_simq, sim, links);
mtx_unlock(&cam_simq_lock);
sim->flags |= CAM_SIM_ON_DONEQ;
- if (first)
+ if (first && panicstr == NULL)
swi_sched(cambio_ih, 0);
}
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20111117/be178486/attachment.pgp
More information about the freebsd-current
mailing list