svn commit: r222866 - head/sys/x86/x86

Mon Jun 20 23:41:16 UTC 2011

On Saturday 18 June 2011 08:05 am, Bruce Evans wrote:
> Long ago, On Wed, 8 Jun 2011, Jung-uk Kim wrote:
> > On Wednesday 08 June 2011 04:55 pm, Bruce Evans wrote:
> >> On Wed, 8 Jun 2011, Jung-uk Kim wrote:
> >>> Log:
> >>>  Introduce low-resolution TSC timecounter "TSC-low".  It
> >>> replaces the normal TSC timecounter if TSC frequency is higher
> >>> than ~4.29 MHz (or 2^32-1 Hz) or
> >>
> >> It should be a separate timecounter so that the user can choose
> >> it independently, at least in the SMP case where it is very low
> >> (at most ~4.29 GHz >> 8 ~= 17 MHz).
> >
> > As I noted in the log, it is still higher than the previous
> > default ACPI-fast, which is ~3.68 MHz and I've never heard of any
> > complaint about ACPI-fast being too low. ;-)
>
> That's because it is too low to measure itself being low :-).
>
> > Nothing prevents us from making a separate timecounter, though. 
> > In fact, we can do the same for ACPI-fast/ACPI-safe.  However,
> > that'll only confuse users, IMHO.
>
> TSC/TSC-low sort of corresponds to ACPI-fast/ACPI-safe.  Users can
> switch between the latter.

How do we do that?

    if (j == 10) {
	acpi_timer_timecounter.tc_name = "ACPI-fast";
	acpi_timer_timecounter.tc_get_timecount =
	    acpi_timer_get_timecount;
	acpi_timer_timecounter.tc_quality = 900;
    } else {
	acpi_timer_timecounter.tc_name = "ACPI-safe";
	acpi_timer_timecounter.tc_get_timecount =
	    acpi_timer_get_timecount_safe;
	acpi_timer_timecounter.tc_quality = 850;
    }

We didn't have any code to influence this selection as far as I
can remember.

> What they can't do is run both concurrently, either to compare them
> or use the best one that works in the current context.  That would
> be more developers and is not implemented mainly because it has more
> complexity (only a tiny amount of extra overhead I think, provided
> you don't try to keep the 2 times coherent -- just an extra windup
> for each active timecounter).
>
> >>> static void tsc_levels_changed(void *arg, int unit);
> >>>
> >>> static struct timecounter tsc_timecounter = {
> >>> @@ -392,11 +393,19 @@ test_smp_tsc(void)
> >>> static void
> >>> init_TSC_tc(void)
> >>
> >> This seems to only be called once at boot time.  So the lowness
> >> may be much lower than necessary if the levels are reduced
> >> significantly later.
> >
> > It'll only happen when the CPU is started at the highest
> > frequency and TSC is not invariant.  In this case, its quality
> > will be set to 800 and HPET or ACPI timecounter will be selected
> > by default.  I don't see much problem with the default choice
> > here.
>
> Can the CPU be started at a low frequency and throttled up later?

Yes, Intel mobile parts may do that.

> I agree that the non-invariant case is not very important. 

Exactly.

> >>> {
> >>> +	uint64_t max_freq;
> >>> +	int shift;
> >>>
> >>> 	if ((cpu_feature & CPUID_TSC) == 0 || tsc_disabled)
> >>> 		return;
> >>>
> >>> 	/*
> >>> +	 * Limit timecounter frequency to fit in an int and prevent
> >>> it from +	 * overflowing too fast.
> >>> +	 */
> >>> +	max_freq = UINT_MAX;
> >>> +
> >>> +	/*
> >>> 	 * We can not use the TSC if we support APM.  Precise
> >>> timekeeping * on an APM'ed machine is at best a fools pursuit,
> >>> since * any and all of the time spent in various SMM code can't
> >>> @@ -418,13 +427,27 @@ init_TSC_tc(void)
> >>> 	 * We can not use the TSC in SMP mode unless the TSCs on all
> >>> CPUs are * synchronized.  If the user is sure that the system
> >>> has synchronized * TSCs, set kern.timecounter.smp_tsc tunable
> >>> to a non-zero value. +	 * We also limit the frequency even
> >>> lower to avoid "temporal anomalies" +	 * as much as possible.
> >>> 	 */
> >>> -	if (smp_cpus > 1)
> >>> +	if (smp_cpus > 1) {
> >>> 		tsc_timecounter.tc_quality = test_smp_tsc();
> >>> +		max_freq >>= 8;
> >>> +	}
> >>
> >> This gives especially low lowness if the levels are reduced
> >> significantly. Maybe as low as 100 MHz >> 8 = ~390 KHz = lower
> >> than an i8254.
> >
> > I don't remember any SMP-capable x86 ever running at 100 MHz
> > unless it is seriously under-clocked.  Even if it existed, it
> > won't be available today. :-P
>
> Doesn't throttling give underclocking?

T-state *usually* does not change CPU frequency itself.  Only P-state 
can change TSC frequency.  However, some broken implementation *may* 
just stop incrementing TSC in very low T-state (or C-state).  AMD 
does not have this problem for invariant TSCs.  It seems Intel also 
fixed it for recent processors.  Nehalem or Sandy Bridge, I am not 
sure, though.

> Maybe not as low as 100 MHz, but quite low.  Only a possible problem
> for the non-invariant case anyway.

Agreed.

> >> OTOH, maybe the temporal anomalies scale with the TSC frequency,
> >> so you need to right shift by a few irrespective of the TSC
> >> frequency. A shift count of 8 seems too much, but if the initial
> >> TSC frequency is already < 2**32 shifted by 8, then the final
> >> shift is 0.
>
> This is my main point.  How can it be right to reduce the extra
> shift for SMP (if this shift is needed at all) just because the
> initial TSC frequency is low?  All instructions are clocked, so
> non-temporalness within a core scales with the current frequency. 
> Oops, this leads back to my previous point that the scaling should
> depend on the current frequency and not just on the initial
> frequency.  Across cores, it isn't so clear what the
> non-temporalness scales with.  The non-temporalness is FUD so its
> scaling could be anything :-).

My questions to you:

a) Why do we care TSC timecounter when it is not invariant where we 
*know* it is unusable and set to negative quality?

b) Why do we complicate code when invariant frequency == current 
frequency == initial frequency?

> >> ...
> >> Perhaps the levels can also be increased significantly later. 
> >> Then the timecounter frequency may exceed 4.29 GHz despite its
> >> scaling.
> >
> > Again, it can only happen when the CPU was started at low
> > frequency and the TSC is not invariant.  For that case, TSC won't
> > be selected by default unless both HPET and ACPI timers are
> > disabled/unavailable.
>
> But users can select it, and since user's can't control the scaling
> or even select between TSC/TSC-low, TSC-low must be scaled properly
> initially to have the best chance of working later.

Maybe we should not allow users to select negative-quality timecounter 
in the first place.  Or maybe we should print scary warning messages 
if they try foot-shooting.  Sigh...

> >>> @@ -520,8 +545,15 @@ SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq
> >>>     0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter
> >>> frequency");
> >>>
> >>> static u_int
> >>> -tsc_get_timecount(struct timecounter *tc)
> >>> +tsc_get_timecount(struct timecounter *tc __unused)
> >>> {
> >>>
> >>> 	return (rdtsc32());
> >>> }
> >>> +
> >>> +static u_int
> >>> +tsc_get_timecount_lowres(struct timecounter *tc)
> >>> +{
> >>> +
> >>> +	return (rdtsc() >> (int)(intptr_t)tc->tc_priv);
> >>
> >> This forces a slow 64-bit shift (shrdl; shrl) in all cases.
> >
> > Yes, it does, unfortunately.
> >
> > I have no clue why AMD didn't implement native 64-bit RDTSC (and
> > RDMSR/WRMSR) in the first place. :-(
>
> I didn't notice before that it still goes to a register pair on
> amd64.
>
> >> rdtsc32() with a scaled tc_counter_mask should work OK
> >> (essentially the same as the non-low timecounter except for
> >> reduced accuracy; the only loss is an decrease in the time until
> >> counter overflow to the same as for the non-low timecounter).
> >
> > I thought about that but I didn't like that idea, i.e., losing
> > resolution and accuracy at the same time.
>
> But it doesn't lose any more resolution or accuracy than any shift
> necessarily uses.  It only loses wrap time, which is of no interest
> for a small reduction.  See another reply.
>
> The shift of 8 for SMP still seems far too much.  clock_gettime()
> with a TSC timecounter on an old 2GHz system takes about 250 nS.  I
> hope it takes only 1/2 that on a newer system.  nanouptime() in the
> kernel takes more like 30 nS on the old system.  It should at least
> try to have enough resulution for sequential calls to it to never
> return the same time (even ACPI-fast has this property -- about
> 1000 nS per call and a resolution of about 250 nS).  rdtsc on old
> Athlons takes only 12 (9?) cycles so you could almost use it to
> time individual instructions (modulo out of order execution).  THe
> invariant versions have to be much slower for synchronization :-(. 
> They take at least 42 cycles AFAIR.  A shift count of 5 would lose
> less resolution than an invariant TSC really has so it would be
> good if it is enough to hide the nontemporalness.  A shift count of
> 6 would be OK too.  But a shift count of 8 lets you execute about 4
> nanouptime()'s for every change in the time returned.  OTOH, 256
> cycles at 4 GHz is about 64 uS and clock_gettime() unfortunately
> takes longer (except on Linux? :-(), so a shift count of 8 is OK
> for it.
>
> My clock measurement program (mostly an old program by Wollman)
> shows the following histogram of times for a non-invariant TSC
> timecounter on a 2GHz UP system:
>
> % min 273, max 265102, mean 273.998217, std 79.069534
> % 1th: 273 (1727219 observations)
> % 2th: 274 (265607 observations)
> % 3th: 275 (6984 observations)
> % 4th: 280 (11 observations)
> % 5th: 290 (8 observations)
>
> The variance is small, and differences of a single nS can be seen
> clearly. With the SMP shift of 8 on a 4GHz system, the minimum
> difference would be 64 nS so it would be impossible to see the
> details of the distribution about the mean of 273.998 nS.

Thanks for the info,

Jung-uk Kim