proposed smp_rendezvous change

Sun May 15 16:36:49 UTC 2011

on 15/05/2011 18:16 John Baldwin said the following:
> On 5/15/11 10:53 AM, Andriy Gapon wrote:
>> on 15/05/2011 10:12 Andriy Gapon said the following:
>>> on 14/05/2011 18:25 John Baldwin said the following:
>>>> Hmmm, so this is not actually sufficient.  NetApp ran into a very similar race
>>>> with virtual CPUs in BHyVe.  In their case because virtual CPUs are threads
>>>> that
>>>> can be preempted, they have a chance at a longer race.
>>>>
>>>> The problem that they see is that even though the values have been updated, the
>>>> next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before
>>>> one of
>>>> the other CPUs notices that it has finished.
>>>
>>> As a follow up to my previous question.  Have you noticed that in my patch no
>>> slave CPU actually waits/spins on smp_rv_waiters[2]?  It's always only master
>>> CPU (and under smp_ipi_mtx).
>>>
>>
>> Here's a cleaner version of my approach to the fix.
>> This one does not remove the initial wait on smp_rv_waiters[0] in
>> smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members
>> and thus hopefully should be clearer.
>>
>> Index: sys/kern/subr_smp.c
>> ===================================================================
>> --- sys/kern/subr_smp.c    (revision 221943)
>> +++ sys/kern/subr_smp.c    (working copy)
>> @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar
>>   static void (*volatile smp_rv_action_func)(void *arg);
>>   static void (*volatile smp_rv_teardown_func)(void *arg);
>>   static void *volatile smp_rv_func_arg;
>> -static volatile int smp_rv_waiters[3];
>> +static volatile int smp_rv_waiters[4];
>>
>>   /*
>>    * Shared mutex to restrict busywaits between smp_rendezvous() and
>> @@ -338,11 +338,15 @@ smp_rendezvous_action(void)
>>
>>       /* spin on exit rendezvous */
>>       atomic_add_int(&smp_rv_waiters[2], 1);
>> -    if (local_teardown_func == smp_no_rendevous_barrier)
>> +    if (local_teardown_func == smp_no_rendevous_barrier) {
>> +        atomic_add_int(&smp_rv_waiters[3], 1);
>>                   return;
>> +    }
>>       while (smp_rv_waiters[2]<  smp_rv_ncpus)
>>           cpu_spinwait();
>>
>> +    atomic_add_int(&smp_rv_waiters[3], 1);
>> +
>>       /* teardown function */
>>       if (local_teardown_func != NULL)
>>           local_teardown_func(local_func_arg);
>> @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map,
>>       /* obtain rendezvous lock */
>>       mtx_lock_spin(&smp_ipi_mtx);
>>
>> +    while (smp_rv_waiters[3]<  smp_rv_ncpus)
>> +        cpu_spinwait();
>> +
>>       /* set static function pointers */
>>       smp_rv_ncpus = ncpus;
>>       smp_rv_setup_func = setup_func;
>> @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map,
>>       smp_rv_func_arg = arg;
>>       smp_rv_waiters[1] = 0;
>>       smp_rv_waiters[2] = 0;
>> +    smp_rv_waiters[3] = 0;
>>       atomic_store_rel_int(&smp_rv_waiters[0], 0);
>>
>>       /* signal other processors, which will enter the IPI with interrupts off */
> 
> Ahh, so the bump is after the change.  I do think this will still be ok
> and I probably just didn't explain it well to Neel.  I wonder though
> if the bump shouldn't happen until after the call of the local teardown
> function?

That should not be necessary according to my understanding unless we have a
reason to care about potentially disappearing function text.  Teardown function
pointer and argument are locally cached and I don't see any problems with a new
rendezvous being setup and run while teardowns of the previous one are still
running.

On the other hand, you might be onto something here.  If the argument is
allocated in temporary storage (e.g. on stack) and the teardown function
actually uses the argument, then it's not safe to deallocate the argument until
all of the teardowns are done.  In this case the master CPU should wait not only
until all actions are done, but also until all teardowns are done.
The current code doesn't do that.  And I am not sure if we actually would want
to add this safety net or just document the behavior and put a burden on
smp_rendezvous API users.

BTW, I don't think that we ever use smp_rendezvous() in its full glory, i.e.
with non-trivial setup, action and teardown (except perhaps for jkim's stress
test).  OpenSolaris seems to be doing just fine with the following three types
of simpler CPU cross-calls (as they call them):
1. async - a master CPU requests an action to be executed on target CPUs but
doesn't wait for anything
- we have no explicit equivalent for this

2. call - a master CPU requests an action to be executed on target CPUs and
waits until all the target CPUs have finished executing it
- this would be equivalent smp_rendezvous_cpus(no_barrier, action, no_barrier)

3. synch - a master CPU requests an action to be executed on target CPUs and
waits until all the target CPUs have finished executing it and the slave CPUs
are also waiting for their siblings to be done
- this would be equivalent smp_rendezvous_cpus(no_barrier, action, NULL)

Seems that our current code doesn't use/need more than that as well.
But I really like the capability that smp_rendezvous_cpus() potentially provides.
-- 
Andriy Gapon