Infinite loop bug in libc_r on 4.x with condition variables and signals

Thu Oct 28 13:41:44 PDT 2004

On Wed, 27 Oct 2004 18:30:15 -0400 (EDT), Daniel Eischen  
<deischen at freebsd.org> wrote:

> On Wed, 27 Oct 2004, John Baldwin wrote:
>
>> On Thursday 21 October 2004 07:04 pm, Daniel Eischen wrote:
>> > On Thu, 21 Oct 2004, John Baldwin wrote:
>> > > The behavior seems more to be this:
>> > >
>> > > - thread does pthread_cond_wait*(c1)
>> > > - thread enqueued on c1
>> > > - thread interrupted by a signal while on c1 but still in PS_RUNNING
>> >
>> > This shouldn't happen when signals are deferred.  It should
>> > only happen when the state is PS_COND_WAIT after we've
>> > context switched to the scheduler.
>> >
>> > > - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag
>> > > (among others)
>> >
>> > Right, because it assumes that the thread will be backed out of
>> > any mutex or CV queues prior to invoking the signal handler.
>> >
>> > > - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but  
>> it's
>> > > not in - this case, this is the normal case though, which is why  
>> it's ok
>> > > to not save the CONDQ flag in the saved state above)
>> >
>> > Right.  The problem is, how is the thread getting setup for a signal
>> > while signals are deferred and the state has not yet been changed
>> > from PS_RUNNING to PS_COND_WAIT?
>> >
>> > > - thread executes signal handler
>> > > - thread restores state
>> > > - pthread_condwait*() see that interrupted is 0, so don't try to  
>> remove
>> > > the thread from the condition variable (also, PTHREAD_FLAGS_IN_CONDQ
>> > > isn't set either, so we can't detect this case that way)
>> > > - thread returns from pthread_cond_wait() (maybe due to timeout,  
>> etc.)
>> > > - thread calls pthread_cond_wait*(c2)
>> > > - thread enqueued on c2
>> > > - another thread does pthread_cond_broadcast(c2), and bewm
>> > >
>> > > My question is is it possible for the thread to get interrupted and
>> > > chosen to run a signal while it is on c1 somehow given my patch to  
>> defer
>> > > signals around the wait loops (and is that patch correct btw given  
>> the
>> > > above scenario?)
>> >
>> > Yes (and yes I think).  Defering signals just means that the signal  
>> handler
>> > won't try to install a signal frame on the current thread; instead it  
>> just
>> > queues the signal and the scheduler will pick it up and send it to the
>> > correct thread.
>> >
>> > I do think signals should be deferred for condition variables so
>> > that setting the thread state (to PS_COND_WAIT) is atomic.
>> >
>> > It's not obvious to be where the bug is.  If you had a simple
>> > test case to reproduce it that would help.
>>
>> FWIW, we are having (I think) the same problem on 5.3 with libpthread.   
>> The
>> panic there is in the mutex code about an assertion failing because a  
>> thread
>> is on a syncq when it is not supposed to be.
>
> David and I recently fixed some races in pthread_join() and
> pthread_exit() in -current libpthread.  Don't know if those
> were responsible...
>
> Here's a test program that shows correct behavior with both
> libc_r and libpthread in -current.

I am not a programmer, but wondering if this races fixed has to do with  
the SIGUSR1? Asking, because in  
lang/mono/files/patch-libgc_include_private_gcconfig.h is what fix the  
Mono build/install hang. If I remove this file and that races fixed should  
fix this? Maybe, I can backport the fixes in -CURRENT to RELENG_5 by  
myself and try to test it.

What lang/mono/files/patch-libgc_include_private_gcconfig.h has look like  
this:

=====================================
-#	define SIG_SUSPEND SIGUSR1
-#	define SIG_THR_RESTART SIGUSR2
+#	define SIG_SUSPEND SIGTSTP
+#	define SIG_THR_RESTART SIGCONT
=====================================

Cheers,
Mezz

-- 
mezz7 at cox.net  -  mezz at FreeBSD.org
FreeBSD GNOME Team
http://www.FreeBSD.org/gnome/  -  gnome at FreeBSD.org