panic: spin lock held too long (RELENG_8 from today)

Wed Aug 17 17:52:07 UTC 2011

Quoth Hiroki Sato on Thursday, 18 August 2011:
> Hi,
> 
> Mike Tancsa <mike at sentex.net> wrote
>   in <4E15A08C.6090407 at sentex.net>:
> 
> mi> On 7/7/2011 7:32 AM, Mike Tancsa wrote:
> mi> > On 7/7/2011 4:20 AM, Kostik Belousov wrote:
> mi> >>
> mi> >> BTW, we had a similar panic, "spinlock held too long", the spinlock
> mi> >> is the sched lock N, on busy 8-core box recently upgraded to the
> mi> >> stable/8. Unfortunately, machine hung dumping core, so the stack trace
> mi> >> for the owner thread was not available.
> mi> >>
> mi> >> I was unable to make any conclusion from the data that was present.
> mi> >> If the situation is reproducable, you coulld try to revert r221937. This
> mi> >> is pure speculation, though.
> mi> >
> mi> > Another crash just now after 5hrs uptime. I will try and revert r221937
> mi> > unless there is any extra debugging you want me to add to the kernel
> mi> > instead  ?
> 
>  I am also suffering from a reproducible panic on an 8-STABLE box, an
>  NFS server with heavy I/O load.  I could not get a kernel dump
>  because this panic locked up the machine just after it occurred, but
>  according to the stack trace it was the same as posted one.
>  Switching to an 8.2R kernel can prevent this panic.
> 
>  Any progress on the investigation?
> 
> --
> spin lock 0xffffffff80cb46c0 (sched lock 0) held by 0xffffff01900458c0 (tid 100489) too long
> panic: spin lock held too long
> cpuid = 1
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
> kdb_backtrace() at kdb_backtrace+0x37
> panic() at panic+0x187
> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
> _mtx_lock_spin() at _mtx_lock_spin+0x9e
> sched_add() at sched_add+0x117
> setrunnable() at setrunnable+0x78
> sleepq_signal() at sleepq_signal+0x7a
> cv_signal() at cv_signal+0x3b
> xprt_active() at xprt_active+0xe3
> svc_vc_soupcall() at svc_vc_soupcall+0xc
> sowakeup() at sowakeup+0x69
> tcp_do_segment() at tcp_do_segment+0x25e7
> tcp_input() at tcp_input+0xcdd
> ip_input() at ip_input+0xac
> netisr_dispatch_src() at netisr_dispatch_src+0x7e
> ether_demux() at ether_demux+0x14d
> ether_input() at ether_input+0x17d
> em_rxeof() at em_rxeof+0x1ca
> em_handle_que() at em_handle_que+0x5b
> taskqueue_run_locked() at taskqueue_run_locked+0x85
> taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
> fork_exit() at fork_exit+0x11f
> fork_trampoline() at fork_trampoline+0xe
> --
> 
> -- Hiroki

I'm also getting similar panics on 8.2-STABLE.  Locks up everything and I
have to power off.  Once, I happened to be looking at the console when it
happened and copied dow the following:

Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock
panic: sleeping thread
cpuid=1

Another time I got:

lock order reversal:
1st 0xffffff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296
2nd 0xffffff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587

I didn't copy down the traceback.

These panics seem to hit when I'm doing heavy WAN I/O.  I can go for
about a day without one as long as I stay away from the web or even chat.
Last night this system copied a backup of 35GB over the local network
without failing, but as soon as I hopped onto Firefox this morning, down
she went.  I don't know if that's coincidence or useful data.

I didn't get to say "Thanks" to Eitan Adler for attempting to help me
with this on Monday night.  Thanks, Eitan!

-- 
.O. | Sterling (Chip) Camden      | http://camdensoftware.com
..O | sterling at camdensoftware.com | http://chipsquips.com
OOO | 2048R/D6DBAF91              | http://chipstips.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20110817/64e6e1bf/attachment.pgp