TSC "skew" (was: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed])

Tue Mar 5 13:19:45 UTC 2019

On Tue, 5 Mar 2019, Bruce Evans wrote:

> On Mon, 4 Mar 2019, Konstantin Belousov wrote:

>* [... shift for bogus TSC-low timecounter]
>> I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
>> Otherwise, I think, some multi-socket machines would start showing the
>> detectable backward-counting bintime().  At the frequencies at 4GHz and
>> above (Intel has 5Ghz part numbers) I do not think that stability of
>> 100MHz crystall and on-board traces is enough to avoid that.
>
> I think it is just a kludge that reduced the problem before it was fixed
> properly using fences.
>
> Cross-socket latency is over 100 cycles according to jhb's tscskew
> benchmark: on Haswell 4x2:
>
> CPU | TSC skew (min/avg/max/stddev)
> ----+------------------------------
>  0 |     0     0     0    0.000
>  1 |    24    49    84   14.353
>  2 |   164   243   308   47.811
>  3 |   164   238   312   47.242
>  4 |   168   242   332   49.593
>  5 |   168   243   324   48.722
>  6 |   172   242   320   52.596
>  7 |   172   240   316   53.014
>
> freefall is similar.  Latency is apparently measured relative to CPU 0.
> It is much lower to CPU 1 since that is on the same core.
>
> I played with this program a lot 3 and a half years ago, but forgot
> mist of what I learned :-(.  I tried different fencing in it.  This
> seems to make little difference when the program is rerun.  With the
> default TESTS = 1024, the min skew sometimes goes negative on freefall,
> but with TESTS = 1024000 that doesn't happen.  This is the opposite
> of what I would expect.  freefall has load average about 1.

I understand this program again.  First, its name is actually tscdrift.
I tested the 2015 version, and this version is still in
/usr/src/tools/tools/tscdrift/tscdrift.c, with no changes to except to
the copyright (rgrimes wouldn't like this) and to $FreeBSD$.

The program doesn't actually measure either TSC drift or TSC skew, except
indirectly.  What it actually measures is the IPC (Inter-Process-
Communication) time for synchronizing the drift and skew measurments,
except bugs or intentional sloppiness in its synchronization also make it
give an indirect measurement of similar bugs or sloppiness in normal use.

After changing TESTS from 1024 to 1024000, it shows large errors in the
negative direction, as expected from either large negative skew or program
bugs: this is on freefall:

XX CPU | TSC skew (min/avg/max/stddev)
XX ----+------------------------------
XX   0 |     0     0     0    0.000
XX   1 | -6148   108 10232   46.871
XX   2 |   114   209 95676   163.359
XX   3 |    96   202 47835   101.250
XX   4 | -2223   207 34017   117.257
XX   5 | -2349   206 33837   106.259
XX   6 | -2664   213 33579   96.048
XX   7 | -2451   212 49242   126.428

The negative "skews" occur because the server and the clients (1 client at
a time) read the TSC with uncontrolled timing after the server opens the
gate for this read (gate = 2).  The IPC time is about 200 cycles to CPUs
on different cores.  So when neither thread is preempted, the TSC on the
server is about 200 cycles in advance.  Sometimes the server is preempted,
so it reads its TSC later than the client (a maximum of about 6148 cycles
later in this test).  More often the client is preempted, since the IPC
time is march larger than the time between the server opening the gate and
the server reading its TSC.

The server is also missing fencing for its TSC read, so this read may
appear to occur several cycles before opening the gate.  This gives a
an error in the positive direction for the reported "skew" (the error
is actually in the positive direction for the reported IPC time).  It
would be useful to measure this error by intentionally omitting fencing,
but currently it is just a small amount of noise on top of the noise from
preemption.

After fixing the syncronization:

XX CPU | TSC skew (min/avg/max/stddev)
XX ----+------------------------------
XX   0 |     0     0     0    0.000
XX   1 |    33    62 49161   57.652
XX   2 |   108   169 33678   73.456
XX   3 |   108   171 43053   119.256
XX   4 |   141   169 41289   114.567
XX   5 |   141   169 40035   112.755
XX   6 |   132   186 147099   269.449
XX   7 |   153   183 431526   436.689

Synchronization apparenly takes a long time, especially to other cores.
The minimum and avergae now gives the best-case IPC time very accurately.
The average is 20-30 cycles smaller than before, probably because I
fixed the fencing.  The maximum and standard deviation are garbage noise
from preemption.  Preemption should be disabled somehow.

Large drifts and skews would show up as nonsense values for the minimum
IPC time.  Small drifts would soon give large skews.  To measure small
skews, change the CPU of the server to measure the minimum IPC time in
the opposite direction.

Fixes:

XX --- tscdrift.c	2015-07-10 06:22:36.505493000 +0000
XX +++ w.c	2019-03-05 11:22:22.232341000 +0000
XX @@ -32,6 +32,15 @@
XX  #include <sys/param.h>
XX  #include <sys/cpuset.h>
XX  #include <machine/atomic.h>
XX +/*
XX + * XXX: atomic.h is not used.  Instead we depend on x86 memory ordering and
XX + * do direct assignments to and comparisons of 'gate', and sometimes add
XX + * memory barriers.  The correct atomic ops would do much the same with
XX + * clearer spelling.  The 'lock' prefix is never needed and the barriers are
XX + * only to get program order so as to give acq or rel semantics for ether
XX + * the loads, the stores or for buggy unfenced rdtsc's.  Fences also give
XX + * program order, so some of the explicit barriers are redundant.
XX + */
XX  #include <machine/cpu.h>
XX  #include <machine/cpufunc.h>
XX  #include <assert.h>
XX @@ -45,7 +54,7 @@
XX 
XX  #define	barrier()	__asm __volatile("" ::: "memory")
XX 
XX -#define	TESTS		1024
XX +#define	TESTS		1024000
XX 
XX  static volatile int gate;
XX  static volatile uint64_t thread_tsc;
XX @@ -74,12 +83,12 @@
XX  		gate = 1;
XX  		while (gate == 1)
XX  			cpu_spinwait();
XX -		barrier();
XX 
XX +		barrier();
XX  		__asm __volatile("lfence");
XX  		thread_tsc = rdtsc();
XX -
XX  		barrier();
XX +
XX  		gate = 3;
XX  		while (gate == 3)
XX  			cpu_spinwait();

This is the client.  The explicit barriers are confusing, and the blank
lines are in all the wrong places.  All the accesses to 'gate' need
to be in program order.  x86 memory ordering gives this automatically
at the hardware level.  'gate' being volatile gives it at the compiler
level.  Both rdtsc() and storing the result to thread_tsc need to be
in program order.  lfence() in cpufunc.h has a memory clobber which
gives the former, but we use a direct asm and need a barrier() before
it to do the same thing.  Then we need another barrier() after the
assignment to thread_tsc so that the store for this is before the store
to 'gate' (I think gate being volatile doesn't give this).  This also
keeps the rdtsc() in program order (the asm for rdtsc() doesn't have
a memory clobber.  I haven't noticed care about this being taken
anywhere else).

Summary: only style changes in this section.

XX @@ -139,12 +148,13 @@
XX  		for (j = 0; j < TESTS; j++) {
XX  			while (gate != 1)
XX  				cpu_spinwait();
XX -			gate = 2;
XX -			barrier();

Move down opening the gate so that it not opened until after reading the
TSC on the server.

XX 
XX +			barrier();
XX +			__asm __volatile("lfence");

Fencing is not critical here.  Using an early TSC value just gives a larger
reported IPC time.  The barrier is important for getting program order of
rdtsc().

XX  			tsc = rdtsc();
XX -
XX  			barrier();

This barrier is still associated with the TSC read, and the blank like is
moved to reflect this.  Here rdtsc() must occur in program order, but
storing to tsc can be after storing to 'gate'.  The barrier gives ordering
for the store to tsc too.

XX +
XX +			gate = 2;
XX  			while (gate != 3)
XX  				cpu_spinwait();
XX  			gate = 4;

I tried some locked atomic ops on 'gate') and mfence instead of lfence
to try to speed up the IPC.  Nothing helped.  We noticed long ago that
fence instructions tend to be even slower that locked atomic ops for
mutexes, and jhb guessed that this might be because fence instructions
don't do so much to force out stores.

Similar IPC is needed for updating timecounters.  This benchmark indicates
that after an update, the change usually won't be visible on other CPUs
for 100+ cycles.  Since updates are rare, this isn't much of a problem.

Similar IPC is needed for comparing timecounters across CPUs.  Any activity
on different CPUs is incomparable without synchronization to establish an
ordering.  Since fences give ordering relative to memory and timecounters
don't use anything except fences and memory order for the generation count
to establish their order, the synchronization for comparing timecounters
(or clock_gettime() at higher levels) must also use memory order.

If the synchronization takes over 100 cycles, then smaller TSC skews don't
matter much (they never break monotonicity, and only show up time differences
varying by 100 or so cycles depending on which CPU measures the start and
end events).  Small differences don't matter at all.  Skews may be caused
by the TSCs actually being out of sync, or hardware only syncing them on
average (hopefully with small jitter) or bugs like missing fences.  Missing
fences don't matter much provided unserialized TSC reads aren't too far
in the past.  E.g., if we had a guarantee of only 10 cycles in the past for
the TSC and 160 cycles for IPCs to other CPUs, then we could omit the fences.
But IPCs to the same core are 100 cycles faster so the margin is too close
for ommitting fences in all cases.

Similarly for imperfect hardware.  Hopefully its skew is in the +-1 cycle
range, but even +-10 isn't a problem if the IPC time is a bit larger than
10 and even +-100 if the IPC time is a bit larger than 100.  And the problem
scales nicely with the distance of the CPUs -- when they are further apart
so that hardware synchronization of their TSCs is more difficult, the IPC
time is large too.

Hmm, that is only with physical IPCs.  Since timecounters use physical
IPCs for everything, they can't work right with virtual synchronization.
Something like ntpd is needed to compare times across even small local
networks.  It does virtual synchronization by compensating for delays.

Bruce