svn commit: r238755 - head/sys/x86/x86

Bruce Evans brde at optusnet.com.au
Wed Jul 25 14:16:05 UTC 2012


On Wed, 25 Jul 2012, Konstantin Belousov wrote:

> On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote:
>> on 25/07/2012 01:10 Jim Harris said the following:
>>> Author: jimharris
>>> Date: Tue Jul 24 22:10:11 2012
>>> New Revision: 238755
>>> URL: http://svn.freebsd.org/changeset/base/238755
>>>
>>> Log:
>>>   Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
>>>
>>>   Intel Architecture Manual specifies that rdtsc instruction is not serialized,
>>>   so without this change, TSC synchronization test would periodically fail,
>>>   resulting in use of HPET timecounter instead of TSC-low.  This caused
>>>   severe performance degradation (40-50%) when running high IO/s workloads due to
>>>   HPET MMIO reads and GEOM stat collection.
>>>
>>>   Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization
>>>   fail approximately 20% of the time.
>>
>> Should rather the synchronization test be fixed if it's the culprit?
> Synchronization test for what ?
>
>> Or is this change universally good for the real uses of TSC?

It's too slow for real uses.  But synchronization code, and some uses
that requires serialization may need it for, er, synchronization and
serialization.

It's hard to think of many uses that need serialization.  I often use
it for timing instructions.  For timng a large number of instructions,
serialization doesn't matter since errors of a few tens in a few billion
done matter.  For timing a small number of instructions, I don't want
serialization, since the serialization invalidates the timing.

Most uses in FreeBSD are for timecounters.  Timecounters deliver the
current time.  This is unrelated to whatever instructions haven't
completed when the TSC is read.  Except possibly when the time needs
to be synchronized across CPUs, and when the uncompleted instruction
is a TSC read.

> For tsc test, this means that after the change RDTSC executions are not
> reordered on the single core among themself. As I understand, CPU has
> no dependency noted between two reads of tsc by RDTSC, which allows
> later read to give lower value of counter.

Gak.  Even when they are in the same instruction sequence?  Even though
the TSC reads fixed registers and some other instructions in the sequence
between the TSC use these registers?  The CPU would have to do significant
register renaming to break this.

> This is fixed by Intel by
> introduction of RDTSCP instruction, which is defined to be serialization
> point, and use of which (instead of LFENCE; RDTSC sequence) also fixes
> test, as confirmed by Jim.

This is not a fix if it is full serialization.  It just gives slowness
using a single instruction instead of a couple.

> In fact, I now think that we should also apply the following patch.
> Otherwise, consequtive calls to e.g. binuptime(9) could return decreased
> time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the
> tsc reads, which was done not for this reason, but apparently needed for
> the reason too.
>
> diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
> index 085c339..229b351 100644
> --- a/sys/x86/x86/tsc.c
> +++ b/sys/x86/x86/tsc.c
> @@ -594,6 +594,7 @@ static u_int
> tsc_get_timecount(struct timecounter *tc __unused)
> {
>
> +	rmb();
> 	return (rdtsc32());
> }

Please don't pessimize this further.  The time for rdtsc went from 6.5
cycles on AthlonXP to 65 cycles on core2 (mainly for for
P-state-invariance hardware synchronization I think).  Pretty soon it
will be as slow as an HPET and heading towards an i8254.  Adding rmb()
only makes it 12 cycles slower on core2, but 16 cycles (almost 3 times)
slower on AthlonXP.

> @@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
> {
> 	uint32_t rv;
>
> +	rmb();
> 	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> -	: "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> +	    : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> 	return (rv);
> }

The previous TSC-low/shrd pessimization adds only 2 cycles on AthlonXP
and core2.  I think it only "works" by backing the TSC's resolution
so low that it usually can't see its own, or at least other TSC's lack of
serialness.  The shift count is usually 7 or 8, so the resolution is
reduced from 1 cycle to 128 or 256.  Out of order times that fall in
the same block of 128 or 256 cycles would appear to be the same, but
out of order times like 129 and 127 would apear to be even more out
of order after a shift of 7 turns them into 128 and 0.

Bruce


More information about the svn-src-all mailing list