powerpc64 on PowerMac G5 4-core (system total): a hack that so far seem to avoid the stuck-sleeping issue [self-hosted buildworld/buildkernel completed]

Wed Mar 6 21:03:53 UTC 2019

[I have a new observed maximum difference, having changed
the code record such.]

On 2019-Mar-4, at 01:40, Mark Millard <marklmi at yahoo.com> wrote:

> [I did some testing of other figures than testing for < 0x10.]
> 
> On 2019-Mar-3, at 13:23, Mark Millard <marklmi at yahoo.com> wrote:
> 
>> [So far the hack has been successful. Details given later
>> below.]
>> 
>> On 2019-Mar-2, at 21:20, Mark Millard <marklmi at yahoo.com> wrote:
>> 
>>> [This note goes in a different direction compared to my
>>> prior evidence report for overflows and the later activity
>>> that has been happening for it. This does *not* involve
>>> the patches associated with that report.]
>>> 
>>> I view the following as an evidence-gathering hack:
>>> showing the change in behavior with the code changes,
>>> not as directly what FreeBSD should do for powerpc64.
>>> In code for defined(__powerpc64__) && defined(AIM)
>>> I freely use knowledge of the PowerMac G5 context
>>> instead of attempting general code.
>>> 
>>> Also: the code is set up to record some information
>>> that I've been looking at via ddb. The recording is
>>> not part of what changes the behavior but I decided
>>> to show that code too.
>>> 
>>> It is preliminary, but, so far, the hack has avoided
>>> buf*daemon* threads and pmac_thermal getting stuck
>>> sleeping (or, at least, far less frequently).
>>> 
>>> 
>>> The tbr-value hack:
>>> 
>>> From what I see the G5 various cores have each tbr running at the
>>> same rate but have some some offsets as far as the base time
>>> goes. cpu_mp_unleash does:
>>> 
>>>      ap_awake = 1;
>>> 
>>>      /* Provide our current DEC and TB values for APs */
>>>      ap_timebase = mftb() + 10;
>>>      __asm __volatile("msync; isync");
>>> 
>>>      /* Let APs continue */
>>>      atomic_store_rel_int(&ap_letgo, 1);
>>> 
>>>      platform_smp_timebase_sync(ap_timebase, 0);
>>> 
>>> and machdep_ap_bootstrap does:
>>> 
>>>      /*
>>>       * Set timebase as soon as possible to meet an implicit rendezvous
>>>       * from cpu_mp_unleash(), which sets ap_letgo and then immediately
>>>       * sets timebase.
>>>       *
>>>       * Note that this is instrinsically racy and is only relevant on
>>>       * platforms that do not support better mechanisms.
>>>       */
>>>      platform_smp_timebase_sync(ap_timebase, 1);
>>> 
>>> 
>>> which attempts to set the tbrs appropriately.
>>> 
>>> But on small scales of differences the various tbr
>>> values from different cpus end up not well ordered
>>> relative to time, synchronizes with, and the like.
>>> Only large enough differences can well indicate an
>>> ordering of interest.
>>> 
>>> Note: tc->tc_get_timecount(tc) only provides the
>>> least signficant 32 bits of the tbr value.
>>> th->th_offset_count is also 32 bits and based on
>>> truncated tbr values.
>>> 
>>> So I made binuptime avoid finishing when it sees
>>> a small (<0x10) step backwards for a new
>>> tc->tc_get_timecount(tc) value vs. the existing
>>> th->th_offset_count value (values strongly tied
>>> to powerpc64 tbr values):
>>> 
>>> . . . [old code omitted] . . .
>>> 
>>> So far as I can tell, the FreeBSD code is not designed to deal
>>> with small differences in tc->tc_get_timecount(tc) not actually
>>> indicating a useful < vs. == vs. > ordering relation uniquely.
>>> 
>>> (I make no claim that the hack is a proper way to deal with
>>> such.)
>> 
>> I did a somewhat over 7 hours buildworld buildkernel on the
>> PowerMac G5. Overall the G5 has been up over 13 hours and
>> none of the buf*daemon* threads have gotten stuck sleeping.
>> Nor has pmac_thermal gotten stuck. Similarly for vnlru
>> and syncer: "top -HIStopid" still shows them all as
>> periodically active.
>> 
>> Previously for this usefdt=1 context (with the modern
>> VM_MAX_KERNEL_ADDRESS), going more than a few minutes
>> without at least one of those threads getting stuck
>> sleeping was rare on the G5 (powerpc64 example).
>> 
>> So this hack has managed to avoid finding sbinuptime()
>> in sleepq_timeout being less than the earlier (by call
>> structure/code sequencing) sbinuptime() in timercb that
>> lead to the sleepq_timeout callout being called in the
>> first place.
>> 
>> So in the sleepq_timeout callout's:
>> 
>>       if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo == 0) {
>>               /*
>>                * The thread does not want a timeout (yet).
>>                */
>>       } else . . .
>> 
>> td->td_sleeptimo > sbinuptime() ends up false now for small
>> enough original differences.
>> 
>> This case does not set up another timeout, it just leaves the
>> thread stuck sleeping, no longer doing periodic activities.
>> 
>> As stands what I did (presuming an appropriate definition
>> of "small differences in the problematical direction") should
>> leave this and other sbinuptime-using code with:
>> 
>> td->td_sleeptimo <= sbinuptime()
>> 
>> for what were originally "small" tbr value differences in the
>> problematical direction (in case other places require it in
>> some way).
>> 
>> If, instead, just sleepq_timeout's test could allow for
>> some slop in the ordering, it could be a cheaper hack then
>> looping in binuptime .
>> 
>> At this point I've no clue what a correct/efficient FreeBSD
>> design for allowing the sloppy match across tbr's for different
>> CPUs would be.
> 
> Instead of 0x10 in "&& tim_offset-tim_cnt<0x10" I tried
> the each of following and they all failed:
> 
> && tim_offset-tim_cnt<0x2
> && tim_offset-tim_cnt<0x4
> && tim_offset-tim_cnt<0x8
> && tim_offset-tim_cnt<0xc

I've now seen a difference of 0x11 that lead to hung
up threads, hung waiting for sleep.

> 0x2, 0x4, and 0x8 failed for the first boot attempt,
> almost mediately having stuck-in-sleep threads.
> 
> 0xc seemed to be working for the first boot (including
> a buildworld buildkernel that did not have to rebuild
> much). But the 2nd boot attempt had a stuck-in-sleep
> thread by the time I logged in.
> 
> By contrast, for:
> 
> && tim_offset-tim_cnt<0x10
> 
> I've not it fail so far, after many reboots, a full
> buildworld buildkernel, and running over 24 hours
> (that included the somewhat over 7 hours for build
> world buildkernel). But it might be that some boots
> would need a bigger figure.
> 

During a ports-mgmt/poudriere-devel run I had some
threads hang in sleep when the code was based on
less than 0x10 differences. But I'd changed to
be recording the maximum "small difference in the
problematical direction" observed and so was able
to see that it got a:

0x11

difference.

The below is the newer code structure as far as what
is recorded. It already has 0x14 instead of 0x10 for
the bound it uses to control the loop. I omitted
#if 0 . . . #endif code that I'm not currently using.

#if defined(__powerpc64__) && defined(AIM)
void
binuptime(struct bintime *bt)
{
        struct timehands *th;
        u_int gen;

        u_int tim_cnt, tim_offset; // HACK!!! (for "small difference is problem direction loop")

        struct timecounter *tc; // HACK!!! (for recording other data for inspection via ddb)
        u_int tim_diff; // HACK!!!
        uint64_t scale_factor, diff_scaled; // HACK!!!

#if 1
        u_int tim_wrong_order_diff= 0u; // HACK!!!
        u_int max_wrong_order_diff= 0u; // HACK!!!
        u_int wrong_order_cnt=      0u; // HACK!!!
        u_int wrong_order_offset=   0u; // HACK!!!
#endif

        do {
                do { // HACK!!!
                    th=  timehands;
                    tc=  th->th_counter;
                    gen= atomic_load_acq_int(&th->th_generation);
                    tim_cnt=    tc->tc_get_timecount(tc);
                    tim_offset= th->th_offset_count;
#if 1
                    tim_wrong_order_diff= tim_offset-tim_cnt;
                    if (  tim_cnt<tim_offset
                       && tim_wrong_order_diff<0x100u
                       && max_wrong_order_diff<tim_wrong_order_diff
                       ) {
                        wrong_order_cnt=      tim_cnt;
                        wrong_order_offset=   tim_offset;
                        max_wrong_order_diff= tim_wrong_order_diff;
                    }
#endif
                } while (tim_cnt<tim_offset && tim_wrong_order_diff<0x14u); // HACK!!!
                *bt = th->th_offset;
                tim_diff= (tim_cnt - tim_offset) & tc->tc_counter_mask;
                scale_factor= th->th_scale;
                diff_scaled= scale_factor * tim_diff;
                bintime_addx(bt, diff_scaled);
                atomic_thread_fence_acq();
        } while (gen == 0 || gen != th->th_generation);

#if 1
        // Uses direct-map addresses (mapping to the most signficant c being masked off).
        // Justin H. reported that some of the 0x0..0xff addresses were unused
        // and available. The 2 larger ranges that I observed to stay at zero
        // were 0x20..0x7f and 0xa..0xff --so that is what I limited myself to.
        if (*(volatile uint64_t*)0xc0000000000000b0<max_wrong_order_diff) { // HACK!!!
                *(volatile uint64_t*)0xc0000000000000a0= wrong_order_cnt;
                *(volatile uint64_t*)0xc0000000000000a8= wrong_order_offset;
                *(volatile uint64_t*)0xc0000000000000b0= max_wrong_order_diff;
        }
#endif
}
#else
. . .
#endif

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)