head -r344018 powerpc64 variant on Powermac G5 (2 sockets, 2 cores each): [*buffer arena] shows up more . . .? [mpc85xx_smp_timebase_sync problem too]

Wed Apr 17 00:21:07 UTC 2019

[Looks a little odder in the details than I initially thought: the
0 in platform_smp_timebase_sync(timebase, 0) does not mean the bsp
instead of some other ap. I've still not found a reasonable way
to not have the sleep-gets-stuck issue for usefdt mode.]

On 2019-Apr-16, at 15:36, Mark Millard <marklmi at yahoo.com> wrote:

> [Unfortunately cpufreq_drv_set leads to additional
> platform_smp_timebase_sync(timebase, 0) use.]
> 
> On 2019-Apr-16, at 14:42, Justin Hibbits <chmeeedalf at gmail.com> wrote:
> 
>> On Tue, 16 Apr 2019 14:26:39 -0700
>> Mark Millard <marklmi at yahoo.com> wrote:
>> 
>>> [Looks to me like the use and content of mpc85xx_smp_timebase_sync
>>> have the same type of problems I noted for the proposed powermac
>>> patch.]
>>> 
>> ...
>>>> 
>>>> As mentioned, I had only compiled it.  Your examination of the code
>>>> path demonstrates that the patch is insufficient, and would hang at
>>>> unleash anyway.  The sleep/wake logic probably needs to be updated
>>>> anyway.  It was written for a G4 powerbook primarily for the
>>>> PMU-based cpufreq driver, so some bits might need to be moved
>>>> around.  Orthogonal to this issue, though.  
>>> 
>>> It appears to me that the overall sequence:
>>> 
>>>      platform_smp_timebase_sync(0, 1); // called from
>>> cpudep_ap_setup . . .
>>> // The following are from in machdep_ap_bootstrap . . .
>>>      while (ap_letgo == 0)
>>>              __asm __volatile("or 31,31,31");
>>>      __asm __volatile("or 6,6,6");
>>>      . . .
>>>      platform_smp_timebase_sync(ap_timebase, 1);
>>> 
>>> calls mpc85xx_smp_timebase_sync twice per ap and that the
>>> 2nd call has tb_ready && mp_ncpus<=cpu_done for each such
>>> ap, leading to a lack of coordination of the activity that
>>> 2nd time for each ap:
>>> 
>>> static void
>>> mpc85xx_smp_timebase_sync(platform_t plat, u_long tb, int ap)
>>> {
>>>       static volatile bool tb_ready;
>>>       static volatile int cpu_done;
>>> 
>>>       if (ap) {
>>>               /* APs.  Hold off until we get a stable timebase. */
>>>               while (!tb_ready)
>>>                       atomic_thread_fence_seq_cst();
>>>               mttb(tb);
>>>               atomic_add_int(&cpu_done, 1);
>>>               while (cpu_done < mp_ncpus)
>>>                       atomic_thread_fence_seq_cst();
>>>       } else {
>>>               /* BSP */
>>>               freeze_timebase(rcpm_dev, true);
>>>               tb_ready = true;
>>>               mttb(tb);
>>>               atomic_add_int(&cpu_done, 1);
>>>               while (cpu_done < mp_ncpus)
>>>                       atomic_thread_fence_seq_cst();
>>>               freeze_timebase(rcpm_dev, false);
>>>       }
>>> }
>>> 
>> 
>> Not for mpc85xx.  This is call is only in the AIM cpudep_ap_setup, and
>> really shouldn't be there anyway.
> 
> Sorry for having looked in the wrong source.
> 
>> The original code to just set the
>> timebase was there to set it to 0 just as a semi-sane value until the
>> core got to a stable state.  The original code was literally "mttb(0)"
>> for G4, and a check for hypervisor with mttb(0).  Had that been
>> sufficient, the platform_smp_timebase_sync() would not be a problem to
>> do the real sync as I had mentioned in my patch before.
>> 
>> The powermac patch that I had provided was derived from this
>> well-working mpc85xx patch.  However, I had neglected to check that the
>> sync wasn't used elsewhere as well.  If the cpudep_ap_setup() use is
>> removed, then this should work fine.
> 
> Unfortunately there is use from cpufreq_drv_set getting to cpu_sleep's
> platform_smp_timebase_sync use as well:
> 
> /usr/src/sys/powerpc/powerpc/mp_machdep.c:      platform_smp_timebase_sync(ap_timebase, 1);
> /usr/src/sys/powerpc/powerpc/mp_machdep.c:      platform_smp_timebase_sync(ap_timebase, 0);
> /usr/src/sys/powerpc/aim/aim_machdep.c: platform_smp_timebase_sync(timebase, 0);
> 
> that last is in cpu_sleep. Tracing towards what ultimately uses cpu_sleep . . .
> 
> void
> powermac_sleep(platform_t platform)
> {
> 
>        *(unsigned long *)0x80 = 0x100;
>        cpu_sleep();
> }
> 
> is used as platform_sleep.
> 
> pu_mu_set_speed calls platform_sleep and is called by pmufreq_set.
> pmufreq_set is used as cpufreq_drv_set.
> 
> At least on the PowerMac11,2, this seems to be in use for
> powerd if powerpd is started.
> 
> It also looks like such use does not try to make the other aps
> track the change, just ap==0 .
> 
> That may explain much of the variability across CPUs for usefdt mode!
> 
> It looks to me like the ap==0 case in the platform_smp_timebase_sync
> use in cpu_sleep would not be well behaved under the patch,
> even for that one ap:
> 
> static void
> powermac_smp_timebase_sync(platform_t plat, u_long tb, int ap)
> {
> +	static int cpus;
> +	static int unleash;
> +
> +	if (ap) {
> +		atomic_add_int(&cpus, 1);
> +		while (!atomic_load_acq_int(&unleash))
> +			;
> +	} else {
> +		atomic_add_int(&cpus, 1);
> +		while (atomic_load_int(&cpus) != mp_ncpus)
> +			;
> +		atomic_store_rel_int(&unleash, 1);
> +	}
> 
> 	mttb(tb);
> }
> 
> The else would increment cpus beyond mp_cpus and the loop
> would be stuck.

cpu_sleep getting to powermac_smp_timebase_sync seems to be based
on cpu_reset_handler doing the longjmp in:

        GET_CPUINFO(%r5)
        ld      %r3,(PC_RESTORE)(%r5)
        cmpldi  %cr0,%r3,0
        beq     %cr0,2f
        nop
        li      %r4,1
        bl      CNAME(longjmp)
        nop
2:
#ifdef SMP
        bl      CNAME(machdep_ap_bootstrap)     /* And away! */
        nop
#endif

The PC_RESTORE related test is tied to:

        PCPU_SET(restore, &resetjb);

in cpu_sleep. The usage is:

        if (setjmp(resetjb) == 0) {
                . . .
                timebase = mftb();
                . . .
                while (1)
                        mtmsr(msr);
        }
        platform_smp_timebase_sync(timebase, 0);
        ...

So it seems there is a  platform_smp_timebase_sync(timebase, 0)
for every cpu that gets cpu_reset_handler invoked, even the
cpu's that are not the bsp.

I had assumed the ", 0" meant a bsp context previously.

There is:

        static u_quad_t timebase = 0;

So timebase is not preserved for each cpu if more than
one ends up in cpu_sleep. I'm not sure the timebase value
is synchronized across cpus. (It is not volatile either.
I'm not sure that the code generation would always
store and load the value from memory.)

Overall it looks like the ", 0" is (ambiguously) intending to
indicate that the platform_smp_timebase_sync activity should
be unsynchronized with other cpus. May be a distinct, alternate
argument value for that that is explicitly handled?

So I'm still not to a stage of having a reasonable workaround
for the "sleep gets stuck" problem for usefdt mode. (Just an
inappropriate avoidance-hack.)

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233863 has
investigative patches for other usefdt tied issues, patches
that could be useful starting points for official updates.
But the hack that I've been using to avoid stuck-sleeps is
not appropriate --and so is not submitted.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)