Infinite loop bug in libc_r on 4.x with condition variables and signals

Thu Oct 21 14:15:04 PDT 2004

On Thu, 21 Oct 2004 12:54:22 -0400, John Baldwin <jhb at FreeBSD.org> wrote:

> On Wednesday 20 October 2004 05:39 pm, Daniel Eischen wrote:
>> On Wed, 20 Oct 2004, John Baldwin wrote:
>> > We are trying to run mono on 4.x and are having problems with the  
>> process
>> > getting stuck spinning in an infinite loop.  After some debugging, we
>> > determined that the problem is that the condition variable thread  
>> queues
>> > are getting corrupted due to threads being added to a queue while they
>> > are already queued on another queue.  For example, if a thread is  
>> somehow
>> > on c1's queue but runs and blocks on c2, later when c1 tries to do a
>> > broadcast, it tries to remove all the waiters to wake them up doing
>> > something like:
>> >
>> > 	while ((head = TAILQ_FIRST(&c1->c_queue)) != NULL) {
>> > 	}
>> >
>> > The problem is that since the thread was last added to c2's queue, his
>> > tqe_prev pointer in his sqe TAILQ_ENTRY points to an item on c2's  
>> list,
>> > and thus the c_queue.tqe_next pointer doesn't get updated by
>> > TAILQ_REMOVE, so the thread just "sticks" on c1's head pointer and it
>> > spins forever.
>> >
>> > We seemed to have tracked this down to some sort of bug related to
>> > signals and condition variables.  It seems that we try to go handle a
>> > signal while we are on a condition variable queue, but not in
>> > PS_COND_WAIT, so
>> > _cond_wait_backout() is not called to remove the thread from the  
>> queue.
>> > I tried deferring signals around the cond queue manipulations in
>> > cond_wait() and cond_timedwait() but we are still seeing the problem.
>> > The patches we currently are using (including debug cruft) are below.
>> > Right now we see the assertion in _thread_sig_wrapper() firing, but  
>> if I
>> > remove that, one of the assertions in the condition variable code that
>> > check for threads not being on the right condition variable queue  
>> trigger
>> > instead.  Does anyone have any other ideas of how a thread could  
>> catch a
>> > signal while PS_RUNNING and on a condition variable queue?  (I'm also
>> > worried that the wait() functions assume that if the thread is
>> > interrupted, its always not on the queue, but that doesn't seem to be  
>> the
>> > case for pthread_cancel() for example.)
>>
>> I'm not sure what's going on, but I do know that you can't call
>> pthread_cond_wait() from a signal handler.  If a thread is blocked
>> on (taking your example) condition variable c1, then a signal
>> interrupts it and it again blocks on condition variable c2, that
>> behavior is undefined (by POSIX).
>
> The behavior seems more to be this:
>
> - thread does pthread_cond_wait*(c1)
> - thread enqueued on c1
> - thread interrupted by a signal while on c1 but still in PS_RUNNING
> - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag  
> (among
>   others)
> - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but it's  
> not in
> - this case, this is the normal case though, which is why it's ok to not  
> save
>   the CONDQ flag in the saved state above)
> - thread executes signal handler
> - thread restores state
> - pthread_condwait*() see that interrupted is 0, so don't try to remove  
> the
> thread from the condition variable (also, PTHREAD_FLAGS_IN_CONDQ isn't  
> set
> either, so we can't detect this case that way)
> - thread returns from pthread_cond_wait() (maybe due to timeout, etc.)
> - thread calls pthread_cond_wait*(c2)
> - thread enqueued on c2
> - another thread does pthread_cond_broadcast(c2), and bewm
>
> My question is is it possible for the thread to get interrupted and  
> chosen to
> run a signal while it is on c1 somehow given my patch to defer signals  
> around
> the wait loops (and is that patch correct btw given the above scenario?)
>
>> Another thing to watch out for is longjmps out of signal handlers
>> after being interrupted while waiting on a condition variable.
>> I think libc_r should handle this, but there could be a bug
>> lurking in that respect.
>
> The thing to note is that my assertion in _thread_sig_wrapper() about  
> being on
> a condition variable queue and executing a handler is that it is placed  
> after
> _cond_wait_backout() could be called (but won't be for PS_RUNNING), and
> before the signal handler itself is called.
>
>> I'll take a look at libc_r and see if I can spot anything obvious.
>
> Ok, thanks.  FWIW, it seems that on 5.3 with KSE, mono does much better,  
> but
> we still see rare hangs, so it maybe that if this bug is fixed it might  
> be
> present in libpthread on 5 as well.

You can check this thread if you are insteresting... It's not about  
libc_r, but about Mono runs on FreeBSD 5.3 and the threads get corrupt if  
you run 'mono -pkg:foopkg foo.cs'.

http://lists.freebsd.org/pipermail/freebsd-threads/2004-October/thread.html#2540

If you know the other fixes, secrets and etc, it would be nice if you can  
info to the bsd-sharp project[1]. Tom is kind of take it over for now  
while the maintainer of lang/mono is busy or has disappeared. Mono works  
better in bsd-sharp's lang/mono than FreeBSD's lang/mono.

[1] http://forge.novell.com/modules/xfmod/project/?bsd-sharp

Cheers,
Mezz

-- 
mezz7 at cox.net  -  mezz at FreeBSD.org
FreeBSD GNOME Team
http://www.FreeBSD.org/gnome/  -  gnome at FreeBSD.org