head -r344018 based powerpc64 pmac_thermal hangup (stuck sleeping): some preliminary evidence [not as uniform as I initially saw]

Mark Millard marklmi at yahoo.com
Mon Feb 25 23:50:03 UTC 2019


[I've now seen examples of sbt==0xfffffed8 that did not lead to
a hangup.]

On 2019-Feb-25, at 14:46, Mark Millard <marklmi at yahoo.com> wrote:

> I adjusted what KTR_PROC does to just show just some of its pid 26
> (pmac_thermal) messages and adding some extra output as well. I'll
> list some of that output later. I'll note that beyond pmac_thermal
> the buf*daemon* threads also seem to be subject to being stuck
> sleeping in (offsets are for a specific build of mine):
> 
> mi_switch+0x134 sleepq_switch+0x2ec sleepq_timedwait+0x48 _sleep+0x41c
> 
> 
> So far for pmac_thermal I've seen that until the failing case:
> 
> sleepq_set_timeout_sbt was being given: sbt==0xcccccbe0 pr==0x0 flags==0x100
> and in turn was using: prec==0xcccccbe flags==0x501 (of course the used
> td_sleeptimo varies).
> 
> [I note that 16*0xcccccbe == 0xcccccbe0, the original sbt value,
> not that I know yet if this matters.]
> 
> But the sequence leading to failures is different:

I've now seen examples of sbt==0xfffffed8 that did not lead to
a hangup. So it is not a reliable predictor of the hang-up in
sleep. I'm trying to see if I can observe a failure with
different value.

> sleepq_set_timeout_sbt was given: sbt==0xfffffed8 pr==0x0 flags==0x100
> and in turn was using: prec==0xfffffed flags==0x501
> 
> [I note that 16*0xfffffed == 0xfffffed0, so less than the original
> sbt value, not that I know this matters at this point.]
> 
> For sbt==0xfffffed8, the callout to sleepq_timeout ends up with values
> like (a particular example):
> 
> td_sleeptimo=0x470d360fe5 sbinuptime=0x46c869f6aa
> 
> where the reporting code looks like:
> 
> static void
> sleepq_timeout(void *arg)
> {
>        struct sleepqueue_chain *sc __unused;
>        struct sleepqueue *sq;
>        struct thread *td;
>        void *wchan;
>        int wakeup_swapper;
>        sbintime_t sbut; // HACK!!!
> 
>        td = arg;
>        wakeup_swapper = 0;
>        if (26 == td->td_proc->p_pid) // HACK!!!
>        CTR3(KTR_PROC, "sleepq_timeout: thread %p (pid %ld, %s)",
>            (void *)td, (long)td->td_proc->p_pid, (void *)td->td_name);
> 
>        thread_lock(td);
> 
>        sbut= sbinuptime(); // HACK!!!
>        if (td->td_sleeptimo > sbut || td->td_sleeptimo == 0) {
>                /*
>                 * The thread does not want a timeout (yet).
>                 */
>                if (26 == td->td_proc->p_pid) // HACK!!!
>                        CTR5(KTR_PROC, "sleepq_timeout thread not want timeout yet: thread %p (pid %ld, %s) td_sleeptimo=%jx sbinuptime=%jx",
>                            (void *)td, (long)td->td_proc->p_pid, (void *)td->td_name, (uintmax_t)td->td_sleeptimo, (uintmax_t)sbut);
> . . .
> 
> So far sleepq_set_timeout_sbt being given sbt==0xfffffed8 instead of
> sbt==0xcccccbe0 seems to be an accurate indicator of if the problem will
> happen in sleepq_timeout. (But I've only a few examples so far.)
> 

I've now seen examples of sbt==0xfffffed8 that did not lead to
a hangup. So it is not a reliable predictor of the hang-up in
sleep.

It looks like the values are sometimes more varied than I'd
seen before as well.

> I'll note that the sleepq_timeout code for this case does not set up
> another callout to itself for later and the sleep then continues
> indefinitely.
> 
> I've not yet gotten into finding evidence for why the callout to
> sleepq_timeout itself happens. Hopefully I can find some.
> 
> 
> An example of some modified KTR_PROC output is:
> 
> . . .



===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)



More information about the freebsd-hackers mailing list