7.2-release/amd64: panic, spin lock held too long

Thu Jul 16 07:06:02 UTC 2009

Attilio Rao wrote:
> 2009/7/8 Dan Naumov <dan.naumov at gmail.com>:
>> On Wed, Jul 8, 2009 at 3:57 AM, Dan Naumov<dan.naumov at gmail.com> wrote:
>>> On Tue, Jul 7, 2009 at 4:27 AM, Attilio Rao<attilio at freebsd.org> wrote:
>>>> 2009/7/7 Dan Naumov <dan.naumov at gmail.com>:
>>>>> On Tue, Jul 7, 2009 at 4:18 AM, Attilio Rao<attilio at freebsd.org> wrote:
>>>>>> 2009/7/7 Dan Naumov <dan.naumov at gmail.com>:
>>>>>>> I just got a panic following by a reboot a few seconds after running
>>>>>>> "portsnap update", /var/log/messages shows the following:
>>>>>>>
>>>>>>> Jul  7 03:49:38 atom syslogd: kernel boot file is /boot/kernel/kernel
>>>>>>> Jul  7 03:49:38 atom kernel: spin lock 0xffffffff80b3edc0 (sched lock
>>>>>>> 1) held by 0xffffff00017d8370 (tid 100054) too long
>>>>>>> Jul  7 03:49:38 atom kernel: panic: spin lock held too long
>>>>>> That's a known bug, affecting -CURRENT as well.
>>>>>> The cpustop IPI is handled though an NMI, which means it could
>>>>>> interrupt a CPU in any moment, even while holding a spinlock,
>>>>>> violating one well known FreeBSD rule.
>>>>>> That means that the cpu can stop itself while the thread was holding
>>>>>> the sched lock spinlock and not releasing it (there is no way, modulo
>>>>>> highly hackish, to fix that).
>>>>>> In the while hardclock() wants to schedule something else to run and
>>>>>> got stuck on the thread lock.
>>>>>>
>>>>>> Ideal fix would involve not using a NMI for serving the cpustop while
>>>>>> having a cheap way (not making the common path too hard) to tell
>>>>>> hardclock() to avoid scheduling while cpustop is in flight.
>>>>>>
>>>>>> Thanks,
>>>>>> Attilio
>>>>> Any idea if a fix is being worked on and how unlucky must one be to
>>>>> run into this issue, should I expect it to happen again? Is it
>>>>> basically completely random?
>>>> I'd like to work on that issue before BETA3 (and backport to
>>>> STABLE_7), I'm just time-constrained right now.
>>>> it is completely random.
>>>>
>>>> Thanks,
>>>> Attilio
>>> Ok, this is getting pretty bad, 23 hours later, I get the same kind of
>>> panic, the only difference is that instead of "portsnap update", this
>>> was triggered by "portsnap cron" which I have running between 3 and 4
>>> am every day:
>>>
>>> Jul  8 03:03:49 atom kernel: ssppiinn  lloocckk
>>> 00xxffffffffffffffff8800bb33eeddc400  ((sscchheedd  lloocck k1 )0 )h
>>> ehledl db yb y 0x0xfffffffffff0f00001081735339760e 0( t(itdi d
>>> 10100006070)5 )t otoo ol olnogng
>>> Jul  8 03:03:49 atom kernel: p
>>> Jul  8 03:03:49 atom kernel: anic: spin lock held too long
>>> Jul  8 03:03:49 atom kernel: cpuid = 0
>>> Jul  8 03:03:49 atom kernel: Uptime: 23h2m38s
>> I have now tried repeating the problem by running "stress --cpu 8 --io
>> 8 --vm 4 --vm-bytes 1024M --timeout 600s --verbose" which pushed
>> system load into the 15.50 ballpark and simultaneously running
>> "portsnap fetch" and "portsnap update" but I couldn't manually trigger
>> the panic, it seems that this problem is indeed random (although it
>> baffles me why is it specifically portsnap triggering it). I have now
>> disabled powerd to check whether that makes any difference to system
>> stability.
> 
> But is that happening at reboot time?
> 
> Thanks,
> Attilio
> 

I think I am also having similar problem on my Atom machine. 
(FreeBSD-7.2-Release-p1)
It does not happen at boot/reboot but panic randomly.
And I found that it remains stable for more than a month now after I 
disabled powerd... (although I want to have it enabled)

--
C.C.