powerpc64 on PowerMac G5 4-core (system total): a hack that so far seem to avoid the stuck-sleeping issue [self-hosted buildworld/buildkernel completed]

Mon Mar 4 09:40:32 UTC 2019

[I did some testing of other figures than testing for < 0x10.]

On 2019-Mar-3, at 13:23, Mark Millard <marklmi at yahoo.com> wrote:

> [So far the hack has been successful. Details given later
> below.]
> 
> On 2019-Mar-2, at 21:20, Mark Millard <marklmi at yahoo.com> wrote:
> 
>> [This note goes in a different direction compared to my
>> prior evidence report for overflows and the later activity
>> that has been happening for it. This does *not* involve
>> the patches associated with that report.]
>> 
>> I view the following as an evidence-gathering hack:
>> showing the change in behavior with the code changes,
>> not as directly what FreeBSD should do for powerpc64.
>> In code for defined(__powerpc64__) && defined(AIM)
>> I freely use knowledge of the PowerMac G5 context
>> instead of attempting general code.
>> 
>> Also: the code is set up to record some information
>> that I've been looking at via ddb. The recording is
>> not part of what changes the behavior but I decided
>> to show that code too.
>> 
>> It is preliminary, but, so far, the hack has avoided
>> buf*daemon* threads and pmac_thermal getting stuck
>> sleeping (or, at least, far less frequently).
>> 
>> 
>> The tbr-value hack:
>> 
>> From what I see the G5 various cores have each tbr running at the
>> same rate but have some some offsets as far as the base time
>> goes. cpu_mp_unleash does:
>> 
>>       ap_awake = 1;
>> 
>>       /* Provide our current DEC and TB values for APs */
>>       ap_timebase = mftb() + 10;
>>       __asm __volatile("msync; isync");
>> 
>>       /* Let APs continue */
>>       atomic_store_rel_int(&ap_letgo, 1);
>> 
>>       platform_smp_timebase_sync(ap_timebase, 0);
>> 
>> and machdep_ap_bootstrap does:
>> 
>>       /*
>>        * Set timebase as soon as possible to meet an implicit rendezvous
>>        * from cpu_mp_unleash(), which sets ap_letgo and then immediately
>>        * sets timebase.
>>        *
>>        * Note that this is instrinsically racy and is only relevant on
>>        * platforms that do not support better mechanisms.
>>        */
>>       platform_smp_timebase_sync(ap_timebase, 1);
>> 
>> 
>> which attempts to set the tbrs appropriately.
>> 
>> But on small scales of differences the various tbr
>> values from different cpus end up not well ordered
>> relative to time, synchronizes with, and the like.
>> Only large enough differences can well indicate an
>> ordering of interest.
>> 
>> Note: tc->tc_get_timecount(tc) only provides the
>> least signficant 32 bits of the tbr value.
>> th->th_offset_count is also 32 bits and based on
>> truncated tbr values.
>> 
>> So I made binuptime avoid finishing when it sees
>> a small (<0x10) step backwards for a new
>> tc->tc_get_timecount(tc) value vs. the existing
>> th->th_offset_count value (values strongly tied
>> to powerpc64 tbr values):
>> 
>> void
>> binuptime(struct bintime *bt)
>> {
>>       struct timehands *th;
>>       u_int gen;
>> 
>>       struct bintime old_bt= *bt; // HACK!!!
>>       struct timecounter *tc; // HACK!!!
>>       u_int tim_cnt, tim_offset, tim_diff; // HACK!!!
>>       uint64_t freq, scale_factor, diff_scaled; // HACK!!!
>> 
>>       u_int try_cnt= 0ull; // HACK!!!
>> 
>>       do {
>>               do { // HACK!!!
>>                   th = timehands;
>>                   tc = th->th_counter;
>>                   gen = atomic_load_acq_int(&th->th_generation);
>>                   tim_cnt= tc->tc_get_timecount(tc);
>>                   tim_offset= th->th_offset_count;
>>               } while (tim_cnt<tim_offset && tim_offset-tim_cnt<0x10);
>>               *bt = th->th_offset;
>>               tim_diff= (tim_cnt - tim_offset) & tc->tc_counter_mask;
>>               scale_factor= th->th_scale;
>>               diff_scaled= scale_factor * tim_diff;
>>               bintime_addx(bt, diff_scaled);
>>               freq= tc->tc_frequency;
>>               atomic_thread_fence_acq();
>>               try_cnt++;
>>       } while (gen == 0 || gen != th->th_generation);
>> 
>>       if (*(volatile uint64_t*)0xc000000000000020==0u && (0xffffffffffffffffull/scale_factor)<tim_diff) { // HACK!!!
>>               *(volatile uint64_t*)0xc000000000000020= bttosbt(old_bt);
>>               *(volatile uint64_t*)0xc000000000000028= bttosbt(*bt);
>>               *(volatile uint64_t*)0xc000000000000030= freq;
>>               *(volatile uint64_t*)0xc000000000000038= scale_factor;
>>               *(volatile uint64_t*)0xc000000000000040= tim_offset;
>>               *(volatile uint64_t*)0xc000000000000048= tim_cnt;
>>               *(volatile uint64_t*)0xc000000000000050= tim_diff;
>>               *(volatile uint64_t*)0xc000000000000058= try_cnt;
>>               *(volatile uint64_t*)0xc000000000000060= diff_scaled;
>>               *(volatile uint64_t*)0xc000000000000068= scale_factor*freq;
>>               __asm__ ("sync");
>>       } else if (*(volatile uint64_t*)0xc0000000000000a0==0u && (0xffffffffffffffffull/scale_factor)<tim_diff) { // HACK!!!
>>               *(volatile uint64_t*)0xc0000000000000a0= bttosbt(old_bt);
>>               *(volatile uint64_t*)0xc0000000000000a8= bttosbt(*bt);
>>               *(volatile uint64_t*)0xc0000000000000b0= freq;
>>               *(volatile uint64_t*)0xc0000000000000b8= scale_factor;
>>               *(volatile uint64_t*)0xc0000000000000c0= tim_offset;
>>               *(volatile uint64_t*)0xc0000000000000c8= tim_cnt;
>>               *(volatile uint64_t*)0xc0000000000000d0= tim_diff;
>>               *(volatile uint64_t*)0xc0000000000000d8= try_cnt;
>>               *(volatile uint64_t*)0xc0000000000000e0= diff_scaled;
>>               *(volatile uint64_t*)0xc0000000000000e8= scale_factor*freq;
>>               __asm__ ("sync");
>>       }
>> }
>> #else
>> . . .
>> #endif
>> 
>> So far as I can tell, the FreeBSD code is not designed to deal
>> with small differences in tc->tc_get_timecount(tc) not actually
>> indicating a useful < vs. == vs. > ordering relation uniquely.
>> 
>> (I make no claim that the hack is a proper way to deal with
>> such.)
> 
> I did a somewhat over 7 hours buildworld buildkernel on the
> PowerMac G5. Overall the G5 has been up over 13 hours and
> none of the buf*daemon* threads have gotten stuck sleeping.
> Nor has pmac_thermal gotten stuck. Similarly for vnlru
> and syncer: "top -HIStopid" still shows them all as
> periodically active.
> 
> Previously for this usefdt=1 context (with the modern
> VM_MAX_KERNEL_ADDRESS), going more than a few minutes
> without at least one of those threads getting stuck
> sleeping was rare on the G5 (powerpc64 example).
> 
> So this hack has managed to avoid finding sbinuptime()
> in sleepq_timeout being less than the earlier (by call
> structure/code sequencing) sbinuptime() in timercb that
> lead to the sleepq_timeout callout being called in the
> first place.
> 
> So in the sleepq_timeout callout's:
> 
>        if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo == 0) {
>                /*
>                 * The thread does not want a timeout (yet).
>                 */
>        } else . . .
> 
> td->td_sleeptimo > sbinuptime() ends up false now for small
> enough original differences.
> 
> This case does not set up another timeout, it just leaves the
> thread stuck sleeping, no longer doing periodic activities.
> 
> As stands what I did (presuming an appropriate definition
> of "small differences in the problematical direction") should
> leave this and other sbinuptime-using code with:
> 
> td->td_sleeptimo <= sbinuptime()
> 
> for what were originally "small" tbr value differences in the
> problematical direction (in case other places require it in
> some way).
> 
> If, instead, just sleepq_timeout's test could allow for
> some slop in the ordering, it could be a cheaper hack then
> looping in binuptime .
> 
> At this point I've no clue what a correct/efficient FreeBSD
> design for allowing the sloppy match across tbr's for different
> CPUs would be.

Instead of 0x10 in "&& tim_offset-tim_cnt<0x10" I tried
the each of following and they all failed:

&& tim_offset-tim_cnt<0x2
&& tim_offset-tim_cnt<0x4
&& tim_offset-tim_cnt<0x8
&& tim_offset-tim_cnt<0xc

0x2, 0x4, and 0x8 failed for the first boot attempt,
almost mediately having stuck-in-sleep threads.

0xc seemed to be working for the first boot (including
a buildworld buildkernel that did not have to rebuild
much). But the 2nd boot attempt had a stuck-in-sleep
thread by the time I logged in.

By contrast, for:

&& tim_offset-tim_cnt<0x10

I've not it fail so far, after many reboots, a full
buildworld buildkernel, and running over 24 hours
(that included the somewhat over 7 hours for build
world buildkernel). But it might be that some boots
would need a bigger figure.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)