SCHED_ULE race condition, fix proposal

Kubilay Kocak koobs at FreeBSD.org
Thu Jan 28 02:08:27 UTC 2016


On 28/01/2016 4:18 AM, Wojciech Macek wrote:
> Hello,
> 
> I've encountered a very nasty race condition during debugging armv8
> HWPMC. It seems that ULE scheduler can execute the same thread on two
> different CPUs at the same time... 
> 
> Here is the scenario.
> The PMC driver must execute some of the code on the CPU0. To ensure
> that, a process migration is triggered as following:
> 
> 
>     thread_lock(curthread);
>     sched_bind(curthread, cpu);
>     thread_unlock(curthread);
> 
>     KASSERT(curthread->td_oncpu == cpu,
>      ("[pmc,%d] CPU not bound [cpu=%d, curr=%d]", __LINE__,
>      cpu, curthread->td_oncpu));
> 
> 
> That causes the context switch and (finally) execution of sched_switch()
> function. The code correctly detects migration and calls
> sched_switch_migrate. That function is supposed to add current thread to
> the runqueue of another CPU ("tdn" variable). So it does:
> 
>     tdq_lock_pair(tdn, tdq);
>     tdq_add(tdn, td, flags);
>     tdq_notify(tdn, td);
>     TDQ_UNLOCK(tdn);
>     spinlock_exit();
> 
> 
> But that sometimes is causing a crash, because the other CPU is staring
> to process mi_switch as soon as the IPI arrives (via tdq_notify) and the
> runqueue lock is released. The problem is, that the thread does not
> contain valid register set, because its context was not yet stored -
> that happens later in machine dependent cpu_switch function. In another
> words, the sched_switch run on the CPU we want the thread to migrate
> onto restores thread context before it was actually stored on another
> core - that causes setting regs/pc/lt to some junk data and crash.
> 
> 
> I'd like to discuss a possible solution for this. I think it would be
> reasonable to extend cpu_switch to be capable of releasing a lock as the
> last thing it does after storing everything into the PCB. We could then
> remove the "TDQ_UNLOCK(tdn);" from the sched_switch_migrate and be sure
> that in the situation of migration nobody is allowed to touch the target
> runqueue until the migrating process finishes storing its context.
> 
> But first I'd like to discuss some possible alternatives and maybe find
> another solution, because any change in this area will impact all
> supported architectures.
> 
> 
> Regards,
> Wojtek
> wma at freebsd.org <mailto:wma at freebsd.org>

Can you create an issue in bugzilla to track this?


More information about the freebsd-arm mailing list