Infinite loop bug in libc_r on 4.x with condition variables and signals

Daniel Eischen deischen at freebsd.org
Wed Oct 20 14:39:41 PDT 2004


On Wed, 20 Oct 2004, John Baldwin wrote:

> We are trying to run mono on 4.x and are having problems with the process
> getting stuck spinning in an infinite loop.  After some debugging, we
> determined that the problem is that the condition variable thread queues are
> getting corrupted due to threads being added to a queue while they are
> already queued on another queue.  For example, if a thread is somehow on c1's
> queue but runs and blocks on c2, later when c1 tries to do a broadcast, it
> tries to remove all the waiters to wake them up doing something like:
>
> 	while ((head = TAILQ_FIRST(&c1->c_queue)) != NULL) {
> 	}
>
> The problem is that since the thread was last added to c2's queue, his
> tqe_prev pointer in his sqe TAILQ_ENTRY points to an item on c2's list, and
> thus the c_queue.tqe_next pointer doesn't get updated by TAILQ_REMOVE, so the
> thread just "sticks" on c1's head pointer and it spins forever.
>
> We seemed to have tracked this down to some sort of bug related to signals and
> condition variables.  It seems that we try to go handle a signal while we are
> on a condition variable queue, but not in PS_COND_WAIT, so
> _cond_wait_backout() is not called to remove the thread from the queue.  I
> tried deferring signals around the cond queue manipulations in cond_wait()
> and cond_timedwait() but we are still seeing the problem.  The patches we
> currently are using (including debug cruft) are below.  Right now we see the
> assertion in _thread_sig_wrapper() firing, but if I remove that, one of the
> assertions in the condition variable code that check for threads not being on
> the right condition variable queue trigger instead.  Does anyone have any
> other ideas of how a thread could catch a signal while PS_RUNNING and on a
> condition variable queue?  (I'm also worried that the wait() functions assume
> that if the thread is interrupted, its always not on the queue, but that
> doesn't seem to be the case for pthread_cancel() for example.)

I'm not sure what's going on, but I do know that you can't call
pthread_cond_wait() from a signal handler.  If a thread is blocked
on (taking your example) condition variable c1, then a signal
interrupts it and it again blocks on condition variable c2, that
behavior is undefined (by POSIX).

Another thing to watch out for is longjmps out of signal handlers
after being interrupted while waiting on a condition variable.
I think libc_r should handle this, but there could be a bug
lurking in that respect.

I'll take a look at libc_r and see if I can spot anything obvious.

-- 
Dan Eischen



More information about the freebsd-threads mailing list