powerpc64 on PowerMac G5 4-core (system total): a hack that so far seem to avoid the stuck-sleeping issue [self-hosted buildworld/buildkernel completed]

Sun Mar 3 21:33:40 UTC 2019

[So far the hack has been successful. Details given later
below.]

On 2019-Mar-2, at 21:20, Mark Millard <marklmi at yahoo.com> wrote:

> [This note goes in a different direction compared to my
> prior evidence report for overflows and the later activity
> that has been happening for it. This does *not* involve
> the patches associated with that report.]
> 
> I view the following as an evidence-gathering hack:
> showing the change in behavior with the code changes,
> not as directly what FreeBSD should do for powerpc64.
> In code for defined(__powerpc64__) && defined(AIM)
> I freely use knowledge of the PowerMac G5 context
> instead of attempting general code.
> 
> Also: the code is set up to record some information
> that I've been looking at via ddb. The recording is
> not part of what changes the behavior but I decided
> to show that code too.
> 
> It is preliminary, but, so far, the hack has avoided
> buf*daemon* threads and pmac_thermal getting stuck
> sleeping (or, at least, far less frequently).
> 
> 
> The tbr-value hack:
> 
> From what I see the G5 various cores have each tbr running at the
> same rate but have some some offsets as far as the base time
> goes. cpu_mp_unleash does:
> 
>        ap_awake = 1;
> 
>        /* Provide our current DEC and TB values for APs */
>        ap_timebase = mftb() + 10;
>        __asm __volatile("msync; isync");
> 
>        /* Let APs continue */
>        atomic_store_rel_int(&ap_letgo, 1);
> 
>        platform_smp_timebase_sync(ap_timebase, 0);
> 
> and machdep_ap_bootstrap does:
> 
>        /*
>         * Set timebase as soon as possible to meet an implicit rendezvous
>         * from cpu_mp_unleash(), which sets ap_letgo and then immediately
>         * sets timebase.
>         *
>         * Note that this is instrinsically racy and is only relevant on
>         * platforms that do not support better mechanisms.
>         */
>        platform_smp_timebase_sync(ap_timebase, 1);
> 
> 
> which attempts to set the tbrs appropriately.
> 
> But on small scales of differences the various tbr
> values from different cpus end up not well ordered
> relative to time, synchronizes with, and the like.
> Only large enough differences can well indicate an
> ordering of interest.
> 
> Note: tc->tc_get_timecount(tc) only provides the
> least signficant 32 bits of the tbr value.
> th->th_offset_count is also 32 bits and based on
> truncated tbr values.
> 
> So I made binuptime avoid finishing when it sees
> a small (<0x10) step backwards for a new
> tc->tc_get_timecount(tc) value vs. the existing
> th->th_offset_count value (values strongly tied
> to powerpc64 tbr values):
> 
> void
> binuptime(struct bintime *bt)
> {
>        struct timehands *th;
>        u_int gen;
> 
>        struct bintime old_bt= *bt; // HACK!!!
>        struct timecounter *tc; // HACK!!!
>        u_int tim_cnt, tim_offset, tim_diff; // HACK!!!
>        uint64_t freq, scale_factor, diff_scaled; // HACK!!!
> 
>        u_int try_cnt= 0ull; // HACK!!!
> 
>        do {
>                do { // HACK!!!
>                    th = timehands;
>                    tc = th->th_counter;
>                    gen = atomic_load_acq_int(&th->th_generation);
>                    tim_cnt= tc->tc_get_timecount(tc);
>                    tim_offset= th->th_offset_count;
>                } while (tim_cnt<tim_offset && tim_offset-tim_cnt<0x10);
>                *bt = th->th_offset;
>                tim_diff= (tim_cnt - tim_offset) & tc->tc_counter_mask;
>                scale_factor= th->th_scale;
>                diff_scaled= scale_factor * tim_diff;
>                bintime_addx(bt, diff_scaled);
>                freq= tc->tc_frequency;
>                atomic_thread_fence_acq();
>                try_cnt++;
>        } while (gen == 0 || gen != th->th_generation);
> 
>        if (*(volatile uint64_t*)0xc000000000000020==0u && (0xffffffffffffffffull/scale_factor)<tim_diff) { // HACK!!!
>                *(volatile uint64_t*)0xc000000000000020= bttosbt(old_bt);
>                *(volatile uint64_t*)0xc000000000000028= bttosbt(*bt);
>                *(volatile uint64_t*)0xc000000000000030= freq;
>                *(volatile uint64_t*)0xc000000000000038= scale_factor;
>                *(volatile uint64_t*)0xc000000000000040= tim_offset;
>                *(volatile uint64_t*)0xc000000000000048= tim_cnt;
>                *(volatile uint64_t*)0xc000000000000050= tim_diff;
>                *(volatile uint64_t*)0xc000000000000058= try_cnt;
>                *(volatile uint64_t*)0xc000000000000060= diff_scaled;
>                *(volatile uint64_t*)0xc000000000000068= scale_factor*freq;
>                __asm__ ("sync");
>        } else if (*(volatile uint64_t*)0xc0000000000000a0==0u && (0xffffffffffffffffull/scale_factor)<tim_diff) { // HACK!!!
>                *(volatile uint64_t*)0xc0000000000000a0= bttosbt(old_bt);
>                *(volatile uint64_t*)0xc0000000000000a8= bttosbt(*bt);
>                *(volatile uint64_t*)0xc0000000000000b0= freq;
>                *(volatile uint64_t*)0xc0000000000000b8= scale_factor;
>                *(volatile uint64_t*)0xc0000000000000c0= tim_offset;
>                *(volatile uint64_t*)0xc0000000000000c8= tim_cnt;
>                *(volatile uint64_t*)0xc0000000000000d0= tim_diff;
>                *(volatile uint64_t*)0xc0000000000000d8= try_cnt;
>                *(volatile uint64_t*)0xc0000000000000e0= diff_scaled;
>                *(volatile uint64_t*)0xc0000000000000e8= scale_factor*freq;
>                __asm__ ("sync");
>        }
> }
> #else
> . . .
> #endif
> 
> So far as I can tell, the FreeBSD code is not designed to deal
> with small differences in tc->tc_get_timecount(tc) not actually
> indicating a useful < vs. == vs. > ordering relation uniquely.
> 
> (I make no claim that the hack is a proper way to deal with
> such.)

I did a somewhat over 7 hours buildworld buildkernel on the
PowerMac G5. Overall the G5 has been up over 13 hours and
none of the buf*daemon* threads have gotten stuck sleeping.
Nor has pmac_thermal gotten stuck. Similarly for vnlru
and syncer: "top -HIStopid" still shows them all as
periodically active.

Previously for this usefdt=1 context (with the modern
VM_MAX_KERNEL_ADDRESS), going more than a few minutes
without at least one of those threads getting stuck
sleeping was rare on the G5 (powerpc64 example).

So this hack has managed to avoid finding sbinuptime()
in sleepq_timeout being less than the earlier (by call
structure/code sequencing) sbinuptime() in timercb that
lead to the sleepq_timeout callout being called in the
first place.

So in the sleepq_timeout callout's:

        if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo == 0) {
                /*
                 * The thread does not want a timeout (yet).
                 */
        } else . . .

td->td_sleeptimo > sbinuptime() ends up false now for small
enough original differences.

This case does not set up another timeout, it just leaves the
thread stuck sleeping, no longer doing periodic activities.

As stands what I did (presuming an appropriate definition
of "small differences in the problematical direction") should
leave this and other sbinuptime-using code with:

td->td_sleeptimo <= sbinuptime()

for what were originally "small" tbr value differences in the
problematical direction (in case other places require it in
some way).

If, instead, just sleepq_timeout's test could allow for
some slop in the ordering, it could be a cheaper hack then
looping in binuptime .

At this point I've no clue what a correct/efficient FreeBSD
design for allowing the sloppy match across tbr's for different
CPUs would be.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)