stop_cpus_hard when multiple CPUs are panicking from an NMI

Attilio Rao attilio at freebsd.org
Fri Nov 16 12:30:15 UTC 2012


On Fri, Nov 16, 2012 at 7:54 AM, Andriy Gapon <avg at freebsd.org> wrote:
> on 16/11/2012 00:58 Ryan Stone said the following:
>> At work we have some custom watchdog hardware that sends an NMI upon
>> expiry.  We've modified the kernel to panic when it receives the watchdog
>> NMI.  I've been trying the "stop_scheduler_on_panic" mode, and I've
>> discovered that when my watchdog expires, the system gets completely
>> wedged.  After some digging, I've discovered is that I have multiple CPUs
>> getting the watchdog NMI and trying to panic concurrently.  One of the CPUs
>> wins, and the rest spin forever in this code:
>>
>> /*
>>      * We don't want multiple CPU's to panic at the same time, so we
>>      * use panic_cpu as a simple spinlock.  We have to keep checking
>>      * panic_cpu if we are spinning in case the panic on the first
>>      * CPU is canceled.
>>      */
>>     if (panic_cpu != PCPU_GET(cpuid))
>>         while (atomic_cmpset_int(&panic_cpu, NOCPU,
>>             PCPU_GET(cpuid)) == 0)
>>             while (panic_cpu != NOCPU)
>>                 ; /* nothing */
>>
>> The system wedges when stop_cpus_hard() is called, which sends NMIs to all
>> of the other CPUs and waits for them to acknowledge that they are stopped
>> before returning.  However the CPU will not deliver an NMI to a CPU that is
>> already handling an NMI, so the other CPUs that got a watchdog NMI and are
>> spinning will never go into the NMI handler and acknowledge that they are
>> stopped.
>
> I thought about this issue and fixed (in my tree) in a different way:
> http://people.freebsd.org/~avg/cpu_stop-race.diff
>
> The need for spinlock_enter in the patch in not entirely clear.
> The main idea is that a CPU which calls cpu_stop and loses a race should
> voluntary enter cpustop_handler.
> I am also not sure about MI-cleanness of this patch.

It is similar to what I propose but with some differences:
- It is not clean from MI perspective
- I don't think we need to treact it specially, I would just
unconditionally stop all the CPUs entering in the "spinlock zone",
making the patch simpler.

So I guess you are really fine with the proposal I made?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein


More information about the freebsd-hackers mailing list