Mozilla problems on current -- fix

Mon Mar 1 15:09:38 PST 2004

On Monday 01 March 2004 03:14 pm, Danny J. Zerkel wrote:
> Looks like John's sys/kern/kern_thread.c 1.171 broke Mozilla (and
> threading in general).  He appears to have left out a test that
> the sleeping thread is interruptable in one spot:
>
> --- kern_thread.c.orig  Mon Mar  1 15:08:01 2004
> +++ kern_thread.c       Mon Mar  1 14:26:30 2004
> @@ -646,7 +646,8 @@
>                  } else if (TD_ON_SLEEPQ(td2) &&
>                             ((td2->td_wchan == &kg->kg_completed) ||
>                              (td2->td_wchan == &p->p_siglist &&
> -                            (ku->ku_mflags & KMF_WAITSIGEVENT)))) {
> +                            (ku->ku_mflags & KMF_WAITSIGEVENT))) &&
> +                             (td2->td_flags & TDF_SINTR)) {
>                          sleepq_abort(td2);
>                  } else {
>                          ku->ku_flags |= KUF_DOUPCALL;
>
> This change fixes my Mozilla hangs and other panics on current.

This should not matter.  This is the only place that doesn't check TDF_SINTR, 
true, but it should always be set in these cases.  First, look in 
kern_thread.c where we msleep() on these addresses (p_siglist and 
kg_completed):

                if (!(p->p_flag & P_SIGEVENT) && !(ku->ku_flags & 
KUF_DOUPCALL))
                        error = msleep(&p->p_siglist, &p->p_mtx, PPAUSE|
PCATCH,
                            "ksesigwait", (uap->timeout ? tvtohz(&tv) : 0));
                p->p_flag &= ~P_SIGEVENT;

Note that PCATCH is set.

                        error = msleep(&kg->kg_completed, &p->p_mtx,
                                PPAUSE|PCATCH, "kserel",
                                (uap->timeout ? tvtohz(&tv) : 0));
                        kg->kg_upsleeps--;

Again, PCATCH is set.  Now onto kern_synch.c and msleep():

	catch = priority & PCATCH;
	...
        sq = sleepq_lookup(ident);
	...
        sleepq_add(sq, ident, mtx, wmesg, 0);
        if (timo)
                sleepq_set_timeout(sq, ident, timo);
        if (catch) {
                sig = sleepq_catch_signals(ident);
                if (sig == 0 && !TD_ON_SLEEPQ(td)) {
                        mtx_lock_spin(&sched_lock);
                        td->td_flags &= ~TDF_SINTR;
                        mtx_unlock_spin(&sched_lock);
                        catch = 0;
                }
        } else
                sig = 0;

	...
	sleepq_wait_foo(ident);

Thus, we see that if PCATCH is set, we call sleepq_add() and then 
sleepq_catch_signals().  Bah, I see.  Only the sleep queue lock is atomic for 
the whole operation and not sched_lock.  While your patch might fix the 
assertion for this KSE case, there might be a larger set or problems with the 
race here, which is that sched_lock can be dropped in between td_wchan being 
set and TDF_SINTR being set.  I.e, if you get an abort in between the 
sleepq_add() and sleepq_catch_signals() and just use sched_lock it will get 
lost when it shouldn't.  For signals, I think the call to cursig() will still 
work ok.  I'm not sure if sleepq_catch_signals() calls anything that checks 
for KUF_DOUPCALL.  Let me check.  Ugh, this code is too hard to follow.  I 
have no idea if this is actually safe or not. :(  I can commit the above as a 
band-aid I guess.  I don't know why the KSE code doesn't just use a wakeup 
here, the sleeps are in top-level functions and not buried deep.

-- 
John Baldwin <jhb at FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org