svn commit: r328110 - head/sys/kern

Bruce Evans brde at optusnet.com.au
Mon Jan 22 08:47:57 UTC 2018


On Sun, 21 Jan 2018, Eric van Gyzen wrote:

> On 01/18/2018 01:38, Wojciech Macek wrote:
>> Author: wma
>> Date: Thu Jan 18 07:38:54 2018
>> New Revision: 328110
>> URL: https://svnweb.freebsd.org/changeset/base/328110
>> 
>> Log:
>>    KDB: restart only CPUs stopped by KDB
>> 
>
>> @@ -707,8 +708,9 @@ kdb_trap(int type, int code, struct trapframe *tf)
>>   	kdb_active--;
>>     #ifdef SMP
>> +	CPU_AND(&other_cpus, &stopped_cpus);
>
> Wojciech,
>
> Coverity told me that other_cpus is used uninitialized here (CID 1385265).

This is harmless in practice (though in theory using an uninitialized
variable gives undefined behaviour) since stopped_cpus is only used
(2 lines later) when other_cpus was initialized.  The quick fix is to move
this AND under the condition where stopped_cpus will be used.

However this AND is either unnecessary or racy.  other_cpus (if we stopped
anything) already has the map of cpus that we tried to stop.  If we didn't
stop them all, then there are various bugs.  If we did stop them all but
some restarted, then this is due to race bugs, but starting them again is
harmless unless it gives further races.  Many of these bugs are fixed in
my version by:
- not giving up trying to stop cpus
- holding the stopping_cpu lock across stop/restart.

It is an error to modify stopped_cpus here at all in -current, since the
stopping_cpus lock is not held yet.  CPU_AND() is not even atomic.
CPU_AND_ATOMIC() would be less racy.  However, in the usual case which
was the only case before the previous commit, other_cpus is all other
cpus so no other cpu can race with us except in the buggy case where we
gave up waiting.

It is also an error to just access stopped_cpus here.  The access was
restart_cpus(stopped_cpus).  stopped_cpus is a (rather large) struct which
is passed by copying it.  There is no locking or even atomic accesses for
the copying.  stopped_cpus is volatile, but that doesn't do anything useful.

These bugs were missing for the restart_cpus(other_cpus) operation.
other_cpus was just all_cpus less the current cpu.  all_cpus isn't volatile
and is hopefully stable at this point.  The previous commit added an
unlocked non-atomic CPU_NAND(&other_cpus, &stopped_cpus).  This might be
racy precisely when it is not null -- that is, when stopped_cpus is 
non-empty.  When stopped_cpus is non-empty, that might be because it is
under construction and is changing underneath us.

My fixes probably make the non-empty initial stopped_cpus case a deadlocked
or sure race case.  Whatever stopped the cpus should still be holding the
stopping_cpus lock.  stop_cpus() will wither deadlock waiting for the cpu
holding the lock, or the other cpu will drop the lock only after restarting
all the cpus that it stopped, so stopped_cpus changes from nonempty to
empty underneath ys.

generic_restart_cpus() has similar races in -current.  It doesn't even
re-acquire the stopped_cpus lock.  It modifies stopped_cpus indirectly
by setting started_cpus to direct the stopped cpus to leave their loop
and clear their stopped bit.  This races with generic_stop_cpus().  kdb
entry can give a thundering herd of cpus entering and exiting kdb for
the same breakpoint.  Restarting always gives races of the herd members
doing restarting with the single herd member allowed to do stopping.

Bruce


More information about the svn-src-all mailing list