An experiment in PowerMac G5 multi-socket/multi-core having better matching mftb() values

Mark Millard marklmi at yahoo.com
Wed May 15 21:41:20 UTC 2019


[Use on an actual 2-socket 7455-based G4 PowerMac has lead to
an update.]

On 2019-May-14, at 17:42, Mark Millard <marklmi at yahoo.com> wrote:

> [Switching to code not using 64-bit atomics.]
> 
> On 2019-May-13, at 03:23, Mark Millard <marklmi at yahoo.com> wrote:
> 
>> I've been experimenting with a alternate
>> technique of dealing with boot-time 970 family
>> PowerMac G5 tbr value synchronization across
>> sockets/cores. So far it has narrowed the
>> range significantly. I've reverted my hack for
>> tolerating the mismatches in order to see how
>> it goes.
>> 
>> . . .
>> 
>> 
>> # svnlite diff /usr/src/sys/powerpc/powermac/platform_powermac.c /usr/src/sys/powerpc/powerpc/mp_machdep.c | more
>> . . .
> 
> So far the experiment has gone well.
> 
> But the original code used 64-bit atomic types
> and so was inappropriate for multi-socket
> PowerMac G4's.
> 
> So I've been experimenting with an alternate
> coding that is not powerpc64 specific.
> 
> I present the updated code below, still only
> enabling the new technique for PowerMac's.
> 
> The example code does show my use of volatile
> for the ap_pcpu pointer value and so would not
> match the svn code in that respect:
> 
> /usr/src/sys/powerpc/aim/aim_machdep.c:extern void * volatile ap_pcpu;
> /usr/src/sys/powerpc/aim/mp_cpudep.c:void * volatile ap_pcpu;
> /usr/src/sys/powerpc/pseries/platform_chrp.c:extern void *ap_pcpu;
> /usr/src/sys/powerpc/powermac/platform_powermac.c:extern void * volatile ap_pcpu;
> 
> (booke has an example volatile for what is pointed to.
> But I've  not dealt with examples that I do not have to test
> and so, thus far, I only changed the above for the issue.)
> 
> (Whitespace details may not survive such e-mail
> based handling.)
> 
> Index: /usr/src/sys/powerpc/powermac/platform_powermac.c
> ===================================================================
> --- /usr/src/sys/powerpc/powermac/platform_powermac.c   (revision 347549)
> +++ /usr/src/sys/powerpc/powermac/platform_powermac.c   (working copy)
> @@ -55,7 +55,7 @@
> 
> #include "platform_if.h"
> 
> -extern void *ap_pcpu;
> +extern void * volatile ap_pcpu;
> 
> static int powermac_probe(platform_t);
> static int powermac_attach(platform_t);
> @@ -333,6 +333,9 @@
>        return (powermac_smp_fill_cpuref(cpuref, bsp));
> }
> 
> +// platform_powermac.c is implicitly an AIM context: no explicit AIM test.
> +extern volatile int alternate_timebase_sync_style; // 0 indicates old style; 1 indicates new style
> +
> static int
> powermac_smp_start_cpu(platform_t plat, struct pcpu *pc)
> {
> @@ -367,6 +370,13 @@
> 
>        ap_pcpu = pc;
> 
> +       // platform_powermac.c is implicitly an AIM context: no explicit AIM test.
> +       // Part of: Attempt a better-than-historical approximately
> +       //          equal timebase value for ap vs. bsp
> +       alternate_timebase_sync_style= 1; // So: new style for PowerMacs
> +
> +       powerpc_sync(); // for ap_pcpu and alternate_timebase_sync_style
> +
>        if (rstvec_virtbase == NULL)
>                rstvec_virtbase = pmap_mapdev(0x80000000, PAGE_SIZE);
> 
> 
> Index: /usr/src/sys/powerpc/powerpc/mp_machdep.c
> . . .

While things seem okay for the 32-bit powerpc build run on the
2 types of G5's that I sometimes have access to, use on a
2-socket 7455-based G4 PowerMac lead to improvements for 32-bit
support . . .

The G4 has a faster tbr increment rate than the G5s but has
notably slower processors than the G5s. This leads to code
activity showing up more in biasing timings. Controlling the
timing in the code of access the 64-bit tbr value used as
a basis for adjusting the tbr via mttb helps significantly
for such a context. The result is definitely better than
the historical technique, for the 3 types of multi-socket
and/or multi-core PowerMacs that I've been able to test.

(powerpc64 FreeBSD is not intended to have much observable
change from the updated source code.)

The replacement for sys/powerpc/powerpc/mp_machdep.c
experiment is:

Index: /usr/src/sys/powerpc/powerpc/mp_machdep.c
===================================================================
--- /usr/src/sys/powerpc/powerpc/mp_machdep.c	(revision 347549)
+++ /usr/src/sys/powerpc/powerpc/mp_machdep.c	(working copy)
@@ -70,6 +70,88 @@
 static struct mtx ap_boot_mtx;
 struct pcb stoppcbs[MAXCPU];
 
+#if defined(AIM)
+// Part of: Attempt a better-than-historical approximately
+//          equal timebase value for ap vs. bsp
+
+volatile int alternate_timebase_sync_style= 0; // 0 indicates old style; 1 indicates new style.
+volatile uint64_t bsp_timebase_sample= 0u;
+
+volatile unsigned int from_bsp_status_flag= 0u;
+// stages: 0u, 1u (bsp ready to start), 2u (bsp tbr value available to ap)
+
+volatile unsigned int from_ap_status_flag= 0u;
+// stages: 0u, 1u (ap ready for bsp tbr value to be found and sent)
+#endif
+
+static __inline uint64_t
+mftb_with_no_pointer_use(void)
+{
+#ifdef __powerpc64__
+	uint64_t tb; // not used for 32-bit powerpc
+	__asm __volatile ("mftb %0" : "=r"(tb));
+	return tb;
+#else
+	uint32_t tbu; // not pointer into tb
+	uint32_t tbl; // not pointer into tb
+
+	do {
+		tbu= mfspr(TBR_TBU);
+		tbl= mfspr(TBR_TBL);
+	} while (tbu != mfspr(TBR_TBU));
+
+	// The construction of the unint64_t value does bias the mttb some
+	// for the round-trip-start side of things.
+	//
+	// The pointers into tb technique would involve a pair of memory
+	// writes and a pair of memory reads instead, the writes being
+	// in the loop.
+	return ((uint64_t)tbu<<32) | tbl;
+#endif
+}
+
+static __inline uint64_t
+mftb_plus_delta(volatile uint64_t* bsp_tbr, int64_t ap_midpoint)
+	// The return value is used in the mttb as the argument.
+{
+#ifdef __powerpc64__
+uint64_t tb; // not used for 32-bit powerpc
+__asm __volatile ("mftb %0" : "=r"(tb));
+	// The construction of the unint64_t value does bias the mttb some:
+	// it assignes an earlier time than hoped for, given these later
+	// calculations.
+	return tb + ((int64_t)*bsp_tbr - ap_midpoint);
+#else
+	// Establishes delta_for_approx_match_to_bsp_tbr_values such that:
+	//           ap_midpoint+delta_for_approx_match_to_bsp_tbr_values==*bsp_tbr
+	int64_t  delta_for_approx_match_to_bsp_tbr_values;
+	uint32_t tbu;       // not pointer into tb
+	uint32_t tbl;       // not pointer into tb
+
+	do {
+		// The below in-loop style is for avoiding the loop
+		// vs. ap_midpoint's calculation being reversed in
+		// the code generated: volatile is is being put to
+		// use here.
+		delta_for_approx_match_to_bsp_tbr_values= (int64_t)*bsp_tbr-ap_midpoint;
+
+		tbu= mfspr(TBR_TBU);
+		tbl= mfspr(TBR_TBL);
+	} while (tbu != mfspr(TBR_TBU));
+
+	// The construction of the unint64_t value does bias the mttb some:
+	// it assignes an earlier time than hoped for, given these later
+	// calculations. Easily observable on the example 7455 based PowerMac
+	// G4. (Faster than G5 tbr increment rate but a slower processor,)
+	// But the overall process is still an improvement.
+	//
+	// The pointers into tb technique would involve a pair of memory
+	// writes and a pair of memory reads instead, the writes being
+	// in the loop. The "+ . . ." would still be involved.
+	return ( ((uint64_t)tbu<<32) | tbl ) + delta_for_approx_match_to_bsp_tbr_values;
+#endif
+}
+
 void
 machdep_ap_bootstrap(void)
 {
@@ -77,19 +159,75 @@
 	PCPU_SET(awake, 1);
 	__asm __volatile("msync; isync");
 
+#if defined(AIM)
+	powerpc_sync();
+	isync();
+	if (1==alternate_timebase_sync_style)
+	{
+		// Part of: Attempt a better-than-historical approximately
+		//          equal timebase value for ap vs. bsp
+
+		// No claim to deal with overflow/wraparound of tbr, or even
+		// of the upper bit being on.
+
+		register_t oldmsr= intr_disable();
+
+		while (1u!=from_bsp_status_flag)
+			; // spin waiting for bsp to flag that its ready to start.
+
+		//  Start to measure a round trip:: to the bsp and back.
+
+		isync(); // Be sure below mftb() result is not from earlier speculative execution.
+		uint64_t const start_round_trip_time_on_ap= mftb_with_no_pointer_use();
+		atomic_store_rel_int(&from_ap_status_flag, 1u); // bsp waits for such before its mftb().
+
+		while (2u!=from_bsp_status_flag)
+			; // spin waiting for bsp's tbr value
+
+		// Mid-point of ap round trip and the bsp timebase value should be approximately equal
+		// when the tbr's are well matched, absent interruptions on both sides.
+
+		isync(); // Be sure below mftb() result is not from earlier speculative execution.
+		uint64_t const end_round_trip_time_on_ap= mftb_with_no_pointer_use();
+
+		int64_t const approx_round_trip_tbr_delta_on_ap
+				= (int64_t)end_round_trip_time_on_ap - (int64_t)start_round_trip_time_on_ap;
+		int64_t const ap_midpoint_value
+				= (int64_t)start_round_trip_time_on_ap + approx_round_trip_tbr_delta_on_ap/2;
+
+		// The mftb_plus_delta use is for helping to the control the code order relative
+		// to tbr access. Such issues are notable for the 7455 based 2-socket PowerMacs,
+		// for example. Faster tbr increment rate than the G5's but slower processors
+		// and such. Still, overall this definitely helps such contexts compared to the
+		// historical style of timebase synchronization.
+		isync(); // Be sure below mftb() result is not from earlier speculative execution.
+		mttb(mftb_plus_delta(&bsp_timebase_sample,ap_midpoint_value));
+
+		atomic_store_rel_int(&from_bsp_status_flag, 0u); // Get ready for next ap in bsp loop
+		atomic_store_rel_int(&from_ap_status_flag, 0u);  // Flag bsp that this ap is done
+
+		mtmsr(oldmsr);
+	}
+#endif
+
 	while (ap_letgo == 0)
 		nop_prio_vlow();
 	nop_prio_medium();
 
-	/*
-	 * Set timebase as soon as possible to meet an implicit rendezvous
-	 * from cpu_mp_unleash(), which sets ap_letgo and then immediately
-	 * sets timebase.
-	 *
-	 * Note that this is instrinsically racy and is only relevant on
-	 * platforms that do not support better mechanisms.
-	 */
-	platform_smp_timebase_sync(ap_timebase, 1);
+#if defined(AIM)
+	if (0==alternate_timebase_sync_style)
+#endif
+	{
+		/*
+		 * Set timebase as soon as possible to meet an implicit rendezvous
+		 * from cpu_mp_unleash(), which sets ap_letgo and then immediately
+		 * sets timebase.
+		 *
+		 * Note that this is instrinsically racy and is only relevant on
+		 * platforms that do not support better mechanisms.
+		 */
+		platform_smp_timebase_sync(ap_timebase, 1);
+	}
 
 	/* Give platform code a chance to do anything else necessary */
 	platform_smp_ap_init();
@@ -260,6 +398,34 @@
 				    pc->pc_cpuid, (uintmax_t)pc->pc_hwref,
 				    pc->pc_awake);
 			smp_cpus++;
+
+#if defined(AIM)
+			// Part of: Attempt a better-than-historical approximately
+			//          equal timebase value for ap vs. bsp
+			powerpc_sync();
+			isync();
+			if (1==alternate_timebase_sync_style)
+			{
+				register_t oldmsr= intr_disable();
+
+				atomic_store_rel_int(&from_bsp_status_flag, 1u); // bsp ready to start.
+
+				while (1u!=from_ap_status_flag)
+					; // spin waiting for ap to flag: time to send a tbr.
+
+				isync(); // Be sure below mftb() result is not from earlier.
+				bsp_timebase_sample= mftb_with_no_pointer_use();
+				atomic_store_rel_int(&from_bsp_status_flag, 2u); // bsp tbr available.
+
+				// Most of the rest of the usage is in machdep_ap_bootstrap,
+				// other than controling alternate_timebase_sync_style value.
+
+				while (0u!=from_ap_status_flag)
+					; // spin waiting for ap to be done with the sample.
+
+				mtmsr(oldmsr);
+			}
+#endif
 		} else
 			CPU_SET(pc->pc_cpuid, &stopped_cpus);
 	}
@@ -266,14 +432,22 @@
 
 	ap_awake = 1;
 
-	/* Provide our current DEC and TB values for APs */
-	ap_timebase = mftb() + 10;
-	__asm __volatile("msync; isync");
+#if defined(AIM)
+	if (0==alternate_timebase_sync_style)
+#endif
+	{
+		/* Provide our current DEC and TB values for APs */
+		ap_timebase = mftb() + 10;
+		__asm __volatile("msync; isync");
+	}
 	
 	/* Let APs continue */
 	atomic_store_rel_int(&ap_letgo, 1);
 
-	platform_smp_timebase_sync(ap_timebase, 0);
+#if defined(AIM)
+	if (0==alternate_timebase_sync_style)
+#endif
+		platform_smp_timebase_sync(ap_timebase, 0);
 
 	while (ap_awake < smp_cpus)
 		;




===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)



More information about the freebsd-ppc mailing list