Stop scheduler on panic

Tue Dec 6 19:01:10 UTC 2011

2011/11/13 Kostik Belousov <kostikbel at gmail.com>:
> I was tricked into finishing the work by Andrey Gapon, who developed
> the patch to reliably stop other processors on panic.  The patch
> greatly improves the chances of getting dump on panic on SMP host.
> Several people already saw the patchset, and I remember that Andrey
> posted it to some lists.
>
> The change stops other (*) processors early upon the panic.  This way,
> no parallel manipulation of the kernel memory is performed by CPUs.
> In particular, the kernel memory map is static.  Patch prevents the
> panic thread from blocking and switching out.
>
> * - in the context of the description, other means not current.
>
> Since other threads are not run anymore, lock owner cannot release a
> lock which is required by panic thread.  Due to this, we need to fake
> a lock acquisition after the panic, which adds minimal overhead to the
> locking cost. The patch tries to not add any overhead on the fast path
> of the lock acquire.  The check for the after-panic condition was
> reduced to single memory access, done only when the quick cas lock
> attempt failed, and braced with __unlikely compiler hint.
>
> For now, the new mode of operation is disabled by default, since some
> further USB changes are needed to make USB keyboard usable in that
> environment.
>
> With the patch, getting a dump from the machine without debugger
> compiled in is much more realistic.  Please comment, I will commit the
> change in 2 weeks unless strong reasons not to are given.
>
> http://people.freebsd.org/~kib/misc/stop_cpus_on_panic.1.patch

Below are reported my notes on the patch:

- Just by looking at kern_mutex.c chunk I see that you have added
SCHEDULER_STOPPED() checks on very specific places, usually after
sanity checks by WITNESS (and similar) and sometimes in odd places (as
the chunk involving _mtx_lock_spin_flags). I think that we should just
skip all the checks along with the hard locking operation. Ideall we
should also skip the fast path, IMHO, but it is impossible (without
polluting it), but at least skip the vast majority of operations for
the hard one, so that we get, for example:
%svn diff -x -p kern/kern_mutex.c | less
Index: kern/kern_mutex.c
===================================================================

--- kern/kern_mutex.c   (revision 228308)
+++ kern/kern_mutex.c   (working copy)
@@ -232,6 +232,8 @@ void
 _mtx_lock_spin_flags(struct mtx *m, int opts, const char *file, int line)
 {

+       if (SCHEDULER_STOPPED())
+               return;
        MPASS(curthread != NULL);
        KASSERT(m->mtx_lock != MTX_DESTROYED,
            ("mtx_lock_spin() of destroyed mutex @ %s:%d", file, line));

In this optic I'd patch directly the hard functions rather than
waiting them to hit the smallest possible common point (which are
_mtx_lock_sleep() and _mtx_lock_spin()). That will make the patch more
verbose but more precise and more correct too.

- This chunk is unneeded now:
@@ -577,6 +589,7 @@ retry:
 				m->mtx_recurse++;
 				break;
 			}
+
 			lock_profile_obtain_lock_failed(&m->lock_object,
 			    &contested, &waittime);
 			/* Give interrupts a chance while we spin. */

- I'm not entirely sure, why we want to disable interrupts at this
moment (before to stop other CPUs)?:
@@ -547,13 +555,18 @@ panic(const char *fmt, ...)
 {
 #ifdef SMP
 	static volatile u_int panic_cpu = NOCPU;
+	cpuset_t other_cpus;
 #endif
 	struct thread *td = curthread;
 	int bootopt, newpanic;
 	va_list ap;
 	static char buf[256];

-	critical_enter();
+	if (stop_scheduler_on_panic)
+		spinlock_enter();
+	else
+		critical_enter();
+

- In this chunk I don't entirely understand the kdb_active check:
@@ -566,11 +579,18 @@ panic(const char *fmt, ...)
 		    PCPU_GET(cpuid)) == 0)
 			while (panic_cpu != NOCPU)
 				; /* nothing */
+	if (stop_scheduler_on_panic) {
+		if (panicstr == NULL && !kdb_active) {
+			other_cpus = all_cpus;
+			CPU_CLR(PCPU_GET(cpuid), &other_cpus);
+			stop_cpus_hard(other_cpus);
+		}
+	}
 #endif

 	bootopt = RB_AUTOBOOT;
 	newpanic = 0;
-	if (panicstr)
+	if (panicstr != NULL)
 		bootopt |= RB_NOSYNC;
 	else {
 		bootopt |= RB_DUMP;

Is it for avoiding to pass an empty mask to stop_cpus() in kdb_trap()
(I saw you changed the policy there)?
Maybe we can find a better integration among the two.
I'd also move the setting of stop_scheduler variable in the "if", it
seems a bug to me to have it set otherwise.

- The same reservations expressed about the hard path on mutex also
applies to rwlock and sxlock.

- I'm not sure I like to change the policies on cpu stopping for KDB
with this patchset. I think we should discuss this furthermore, in
particular in terms of reviewing what accesses points KDB has and if
they are all covered.
If someone can summary this up (and has already made the analysis) and
would please share his findings I'd be happy about it, otherwise we
should not commit the stop_cpu approach in kdb_trap() IMHO.

Thanks,
Attilio