svn commit: r238755 - head/sys/x86/x86

Konstantin Belousov kostikbel at gmail.com
Wed Jul 25 13:38:00 UTC 2012


On Wed, Jul 25, 2012 at 03:29:34PM +0300, Andriy Gapon wrote:
> on 25/07/2012 13:21 Konstantin Belousov said the following:
> > On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote:
> >> on 25/07/2012 01:10 Jim Harris said the following:
> >>> Author: jimharris
> >>> Date: Tue Jul 24 22:10:11 2012
> >>> New Revision: 238755
> >>> URL: http://svn.freebsd.org/changeset/base/238755
> >>>
> >>> Log:
> >>>   Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
> >>>   
> >>>   Intel Architecture Manual specifies that rdtsc instruction is not serialized,
> >>>   so without this change, TSC synchronization test would periodically fail,
> >>>   resulting in use of HPET timecounter instead of TSC-low.  This caused
> >>>   severe performance degradation (40-50%) when running high IO/s workloads due to
> >>>   HPET MMIO reads and GEOM stat collection.
> >>>   
> >>>   Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization
> >>>   fail approximately 20% of the time.
> >>
> >> Should rather the synchronization test be fixed if it's the culprit?
> > Synchronization test for what ?
> 
> The synchronization test mentioned above.
> So, oops, very sorry - I missed the fact that the change was precisely in the
> test.  I confused it for another place where tsc is used.  Thank you for pointing
> this out.
> 
> >> Or is this change universally good for the real uses of TSC?
> > 
> > What I understood from the Intel SDM, and also from additional experiments
> > which Jim kindly made despite me being annoying as usual, is that 'read
> > memory barrier' AKA LFENCE there is used for its secondary implementation
> > effects, not for load/load barrier as you might assume.
> > 
> > According to SDM, LFENCE fully drains execution pipeline (but comparing
> > with MFENCE, does not drain write buffers). The result is that RDTSC is
> > not started before previous instructions are finished.
> 
> Yes, I am fully aware of this.
> 
> > For tsc test, this means that after the change RDTSC executions are not
> > reordered on the single core among themself. As I understand, CPU has
> > no dependency noted between two reads of tsc by RDTSC, which allows
> > later read to give lower value of counter. This is fixed by Intel by
> > introduction of RDTSCP instruction, which is defined to be serialization
> > point, and use of which (instead of LFENCE; RDTSC sequence) also fixes
> > test, as confirmed by Jim.
> 
> Yes.  I think that previously Intel recommended to precede rdtsc with cpuid for
> all the same reasons.  Not sure if there is any difference performance-wise
> comparing to lfence.
> Unfortunately, rdtscp is not available on all CPUs, so using it would require
> extra work.
> 
> > In fact, I now think that we should also apply the following patch.
> > Otherwise, consequtive calls to e.g. binuptime(9) could return decreased
> > time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the
> > tsc reads, which was done not for this reason, but apparently needed for
> > the reason too.
> > 
> > diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
> > index 085c339..229b351 100644
> > --- a/sys/x86/x86/tsc.c
> > +++ b/sys/x86/x86/tsc.c
> > @@ -594,6 +594,7 @@ static u_int
> >  tsc_get_timecount(struct timecounter *tc __unused)
> >  {
> >  
> > +	rmb();
> >  	return (rdtsc32());
> >  }
> >  
> 
> This makes sense to me.  We probably want correctness over performance here.
> [BTW, I originally thought that the change was here; brain malfunction]
> 
> > @@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
> >  {
> >  	uint32_t rv;
> >  
> > +	rmb();
> >  	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> > -	: "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> > +	    : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> >  	return (rv);
> >  }
> >  
> 
> It would correct here too, but not sure if it would make any difference given that
> some lower bits are discarded anyway.  Probably depends on exact CPU.
> 
> 
> And, oh hmm, I read AMD Software Optimization Guide for AMD Family 10h
> Processors and they suggest using cpuid (with a note that it may be
> intercepted in virtualized environments) or _mfence_ in the discussed
> role (Appendix F of the document). Googling for 'rdtsc mfence lfence'
> yields some interesting results.
Yes, MFENCE for AMD.

Since I was infected with these Google results anyway, I looked at the
Linux code. Apparently, they use MFENCE on amd, and LFENCE on Intel.
They also use LFENCE on VIA, it seems. Intel documentation claims that
MFENCE does not serialize instruction execution, which is contrary to
used LFENCE behaviour.

So we definitely want to add some barrier right before rdtsc. And we do
want LFENCE for Intels. Patch below ends with the following code:

Dump of assembler code for function tsc_get_timecount_lfence:
   0xffffffff805563a0 <+0>:     push   %rbp
   0xffffffff805563a1 <+1>:     mov    %rsp,%rbp
   0xffffffff805563a4 <+4>:     lfence 
   0xffffffff805563a7 <+7>:     rdtsc  
   0xffffffff805563a9 <+9>:     leaveq 
   0xffffffff805563aa <+10>:    retq   
Dump of assembler code for function tsc_get_timecount_low_lfence:
   0xffffffff805556a0 <+0>:     push   %rbp
   0xffffffff805556a1 <+1>:     mov    %rsp,%rbp
   0xffffffff805556a4 <+4>:     lfence 
   0xffffffff805556a7 <+7>:     mov    0x30(%rdi),%rcx
   0xffffffff805556ab <+11>:    rdtsc  
   0xffffffff805556ad <+13>:    shrd   %cl,%edx,%eax
   0xffffffff805556b0 <+16>:    leaveq 
   0xffffffff805556b1 <+17>:    retq 
which I would qualify as perfect outcome.

Patch below still leaves libc only suitable for Intels and VIA, not for AMD,
but I will take care of it later.

diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
index c253a96..06cfe10 100644
--- a/sys/x86/x86/tsc.c
+++ b/sys/x86/x86/tsc.c
@@ -83,6 +83,10 @@ static void tsc_freq_changing(void *arg, const struct cf_level *level,
     int *status);
 static unsigned tsc_get_timecount(struct timecounter *tc);
 static unsigned tsc_get_timecount_low(struct timecounter *tc);
+static unsigned tsc_get_timecount_lfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_low_lfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_mfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_low_mfence(struct timecounter *tc);
 static void tsc_levels_changed(void *arg, int unit);
 
 static struct timecounter tsc_timecounter = {
@@ -262,6 +266,10 @@ probe_tsc_freq(void)
 		    (vm_guest == VM_GUEST_NO &&
 		    CPUID_TO_FAMILY(cpu_id) >= 0x10))
 			tsc_is_invariant = 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =
+			    tsc_get_timecount_mfence;
+		}
 		break;
 	case CPU_VENDOR_INTEL:
 		if ((amd_pminfo & AMDPM_TSC_INVARIANT) != 0 ||
@@ -271,6 +279,10 @@ probe_tsc_freq(void)
 		    (CPUID_TO_FAMILY(cpu_id) == 0xf &&
 		    CPUID_TO_MODEL(cpu_id) >= 0x3))))
 			tsc_is_invariant = 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =
+			    tsc_get_timecount_lfence;
+		}
 		break;
 	case CPU_VENDOR_CENTAUR:
 		if (vm_guest == VM_GUEST_NO &&
@@ -278,6 +290,10 @@ probe_tsc_freq(void)
 		    CPUID_TO_MODEL(cpu_id) >= 0xf &&
 		    (rdmsr(0x1203) & 0x100000000ULL) == 0)
 			tsc_is_invariant = 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =
+			    tsc_get_timecount_lfence;
+		}
 		break;
 	}
 
@@ -328,7 +344,15 @@ init_TSC(void)
 
 #ifdef SMP
 
-/* rmb is required here because rdtsc is not a serializing instruction. */
+/*
+ * RDTSC is not a serializing instruction, so we need to drain
+ * instruction stream before executing it. It could be fixed by use of
+ * RDTSCP, except the instruction is not available everywhere.
+ *
+ * Insert both MFENCE for AMD CPUs, and LFENCE for others (Intel and
+ * VIA), and assume that SMP test is only performed on CPUs that have
+ * SSE2 anyway.
+ */
 #define	TSC_READ(x)			\
 static void				\
 tsc_read_##x(void *arg)			\
@@ -337,6 +361,7 @@ tsc_read_##x(void *arg)			\
 	u_int cpu = PCPU_GET(cpuid);	\
 					\
 	rmb();				\
+	mb();				\
 	tsc[cpu * 3 + x] = rdtsc32();	\
 }
 TSC_READ(0)
@@ -487,7 +512,16 @@ init:
 	for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++)
 		;
 	if (shift > 0) {
-		tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
+		if (cpu_feature & CPUID_SSE2) {
+			if (cpu_vendor_id == CPU_VENDOR_AMD) {
+				tsc_timecounter.tc_get_timecount =
+				    tsc_get_timecount_low_mfence;
+			} else {
+				tsc_timecounter.tc_get_timecount =
+				    tsc_get_timecount_low_lfence;
+			}
+		} else
+			tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
 		tsc_timecounter.tc_name = "TSC-low";
 		if (bootverbose)
 			printf("TSC timecounter discards lower %d bit(s)\n",
@@ -592,23 +626,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS)
 SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW,
     0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter frequency");
 
-static u_int
+static inline u_int
 tsc_get_timecount(struct timecounter *tc __unused)
 {
 
 	return (rdtsc32());
 }
 
-static u_int
+static inline u_int
 tsc_get_timecount_low(struct timecounter *tc)
 {
 	uint32_t rv;
 
 	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
-	: "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
+	    : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
 	return (rv);
 }
 
+static inline u_int
+tsc_get_timecount_lfence(struct timecounter *tc __unused)
+{
+
+	rmb();
+	return (rdtsc32());
+}
+
+static inline u_int
+tsc_get_timecount_low_lfence(struct timecounter *tc)
+{
+
+	rmb();
+	return (tsc_get_timecount_low(tc));
+}
+
+static inline u_int
+tsc_get_timecount_mfence(struct timecounter *tc __unused)
+{
+
+	mb();
+	return (rdtsc32());
+}
+
+static inline u_int
+tsc_get_timecount_low_mfence(struct timecounter *tc)
+{
+
+	mb();
+	return (tsc_get_timecount_low(tc));
+}
+
 uint32_t
 cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th)
 {
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/svn-src-all/attachments/20120725/4cb86b97/attachment.pgp


More information about the svn-src-all mailing list