stop_cpus*() interface

Wed Jun 22 16:43:50 UTC 2011

I would like to propose to narrow stop_cpus*() interface:

1. Remove cpu mask/set parameter.  Rationale for this is presented below in a
forwarded message from a private discussion.  You may also see that currently
stop_cpus*() functions are always called with either (1) other_cpus mask or (2)
other_cpus & ~stopped_cpus mask, where (2) is really equivalent to (1) because of (1).

2. Change return type to void.  Currently return value of stop_cpus*() is never
handled and it can not be really handled meaningfully.  Simple boolean or errno
return value can not convey which target CPUs were already stopped and which
failed to become stopped and why.  I think that it's better to assume that
stop_cpus*() should never fail and add necessary diagnostics to catch cases where
it does fail.

The below forwarded message provides my thoughts on CPU stopping semantics and
additionally presents my analysis of CPU stopping code in OpenSolaris.

-------- Original Message --------
on 12/05/2011 21:17 Andriy Gapon said the following:
> cpu_hard_stop does stop other CPUs in a hard way.  At least on some archs it is
> really so, e.g. x86 NMI.  This means that stopped CPUs, rather threads that were
> running on them, can be stopped in any kinds of contexts with any kinds of locks
> held, including spinlocks.  Given that fact, it is really unsafe to continue
> using any locks after even one CPU is hard-stopped.  So any remaining running
> CPUs should be put into a special non-locking mode.  This is the reason that we
> invent things like THREAD_PANICED() and use polling mode in kdb context, etc.
> But having more than one CPU, in fact even more than one thread, running in
> non-locking mode is unsafe again - if those CPUs continue execution without any
> synchronization, then they would corrupt shared data.
> Thus, I argue that hard stopping should leave only one CPU and thread running.

Some more thoughts.

I think that the above reasoning does even apply to the current soft stopping to
a certain degree.  Soft stopping would not leave any spinlocks held, true, but
it can still leave other kinds of locks held, e.g. regular mutexes, sx locks.
And that also produces a very special environment in the end.
So in my opinion current soft stopping should also always stop all other CPUs.

I think that eventually we will need "really soft" graceful stopping mechanism.
That mechanism would rebind all interrupts away from a CPU being stopped, would
migrate all (non-special) threads away from the CPU, would instruct scheduler to
not run any threads on the CPU, would remove it from any active CPU sets, etc.
Now, this mechanism should really be of a targeted variety, no doubt.

I also would like to share some of my observations of OpenSolaris code.
This is not to try to give any support to my proposals - after all we are not
Solaris, but FreeBSD - but simply to share some ideas.

In OpenSolaris I've noticed three separate CPU stopping mechanisms so far.  I am
sure that they have more :-)

1. Stopping by debugger.  This is very similar to our hard stopping (in their
x86 code[*]).  All other CPUs are always stopped.  One difference is that the
stopped CPUs run a special command loop while spinning.  The master CPU can send
a few commands to the slave CPUs.  Examples:  the master can tell a slave, if
it's a BSP, to reset a system; the master can tell a slave to become a new
master (I think that this is somewhat equivalent to "thread N" command in gdb).
All commands:
#define     KMDB_DPI_CMD_RESUME_ALL         1       /* Resume all CPUs */
#define     KMDB_DPI_CMD_RESUME_MASTER      2       /* Resume only master CPU */
#define     KMDB_DPI_CMD_RESUME_UNLOAD      3       /* Resume for debugger unload */
#define     KMDB_DPI_CMD_SWITCH_CPU         4       /* Switch to another CPU */
#define     KMDB_DPI_CMD_FLUSH_CACHES       5       /* Flush slave caches */
#define     KMDB_DPI_CMD_REBOOT             6       /* Reboot the machine */

2. Stopping for panic.  This is very similar to our hard stopping (in their x86
code[*]).  All other CPUs are always stopped.   But this is done via different
code than what debugger does, I am not sure why, maybe some historic legacy.
The difference from our code and the debugger code is the stopped CPUs run a
different stop loop and may do some useful panic work.  E.g. my understanding is
that they can be used for compressing a dump image (yes, they compress their dumps
for disk writing speed I guess).

3. Something remotely similar to our current soft stopping.  Big difference is
that they have special "pause" threads per cpu.  This mechanism activates those
threads, the threads make themselves non-preemptable, disable interrupts and
block on some sort of a semaphore until they are told to resume.  Not sure what
advantage, if any, this mechanism gives them comparing to our approach.
The mechanism is invoked via pause_cpus() call.  It is used mainly to change
state of CPUs (some per-CPU data), like e.g. configuring idle hooks, power
management.

[!] BTW, they also use this mechanism when onlining/offlining CPUs to avoid
locking in normal paths.  That is, for instance, they stop/pause all CPUs, mark
a target CPU as offline, and then restart all CPUs.  This way they don't need
any locking when checking (and changing) CPU status.  Of course, they also do
all the reasonable things to do - unbinding interrupts, moving away treads, etc.
The mechanism is also used for their checkpoint-resume code (which is used by
suspend/resume) and in their shutdown/reboot path.
This CPU stopping mechanism also always stops all other CPUs.

[*] Another difference to note is that they don't use NMI for their equivalents
of our hard stopping.  They still have the notion of interrupt levels and
various spl* stuff.  So they just have a normal interrupt with highest priority
to penetrate protected contexts.  E.g. in their equivalent of spinlock_enter()
they do not outright disable interrupts, but set current level to a special
'LOCK' level which inhibits all typical (hardware and IPI) interrupts.  This
mechanism adds another degree of freedom to their implementation, as such it
complicates code and logic, but also adds some flexibility.

I hope that there is something useful for you and FreeBSD in this lengthy overview.

-- 
Andriy Gapon