svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386

Fri Mar 22 15:48:24 UTC 2019

On Thu, 21 Mar 2019, Mark Johnston wrote:

> On Thu, Mar 21, 2019 at 10:44:49PM +1100, Bruce Evans wrote:
>> On Wed, 20 Mar 2019, Mark Johnston wrote:
>>
>>> On Wed, Mar 20, 2019 at 08:20:18PM -0700, Rodney W. Grimes wrote:
>>>>> Author: markj
>>>>> Date: Thu Mar 21 02:52:22 2019
>>>>> New Revision: 345359
>>>>> URL: https://svnweb.freebsd.org/changeset/base/345359
>>>>>
>>>>> Log:
>>>>>   Don't attempt to measure TSC skew when running as a VM guest.
>>>>>
>>>>>   It simply doesn't work in general since VCPUs may migrate between
>>>>>   physical cores.  The approach used to measure skew also doesn't
>>>>>   make much sense in a VM.
>>
>> I think it simply doesn't work.
>> ...
>> The method is bogus (buggy) out of a VM.
>>
>> XX static void
>> XX dtrace_gethrtime_init(void *arg)
>> XX {
>> XX 	struct pcpu *pc;
>> XX 	cpuset_t map;
>> XX 	int i;
>> XX #endif
>> XX
>> XX 	/* The current CPU is the reference one. */
>> XX 	sched_pin();
>>
>> It is also necessary to disable interrupts in hardware.  Or maybe better
>> and necessary, detect when they occur and discard related samples.  It
>> is impossible to disable NMIs.
>
> Not quite, just carry out the measurement from an NMI handler. :)

:-)

I can break that by single-stepping it in ddb.  That is similar to emulation
wit random large stretching of cycles.

>> I plan to try a variation of this: all CPUs spin reading the TSC and
>> appending the result atomically to a log buffer.  Then analyze the results.
>> With no skew or interrupts, it is expected that the results are almost
>> monotonic.  Contention causes non-monotonicity only occasionally.  The
>> average non-mononicity should be 0.  If there is significant skew, then
>> the order for the sameples form each pair of CPUs will reverse on almost
>> every second reading.  The skew is the adjustment needed to stop this
>> reversal.

I tried this.

The free-running method sort of works, but it is better to serialize
everything.  The result of the serialization is essentially the same as
in the fixed tscdrift, though I used a log buffer index instead of a
gate variable to synchronize.

The method is to calculate the serialization latency in both directions,
with each direction using identically code and hopefully identical CPU
state after warming up caches.  The skew is then the difference of the
latency divided by 2.  Oops, I forgot to divide by 2.  This gives the
following output on i4790K:

XX |   CPUs  |      latij     |      latji     |       skew      |   nr   |
XX |---------+----------------+----------------+-----------------+--------|
XX  0 <->  1    72/143/460/ 7    68/143/460/ 5   -316/  0/416/ 9   524286

latij is latency from CPU i to CPU j.  nr is the number of samples (512K
rdtsc()s on each CPU in the pair, with a few discards for latencies above
999 cycles discarded.  The i4790K was run with interrupts disabled, so I
don't know why there were so many discards).  The 4 values in each of the
3 middle columns are min/avg/max/sdev.

Note: for CPUs on the same core, the latencies are about 120 cycles
smaller and the averag skews are very close to 0.  This and latencies
never being negative show that the method basically works.

XX  0 <->  2   192/263/992/25   192/258/992/16   -724/  5/748/29   524267
XX  0 <->  3   192/263/992/25   188/258/992/15   -724/  5/748/29   524266
XX  0 <->  4   188/255/964/21   188/273/964/42   -724/-18/752/46   524266
XX  0 <->  5   188/255/976/20   188/273/976/42   -724/-18/752/45   524274
XX  0 <->  6   200/281/996/20   200/270/996/14   -728/ 11/752/22   524273
XX  0 <->  7   200/281/964/20   196/270/964/13   -728/ 10/752/22   524276

There does seem to be some skew.  Parts of rest of table collected
manually and repeated here for comparison with the above:

XX  2 <->  0   188/258/968/16   192/263/968/25   -768/ -5/760/29   524273
XX  3 <->  0   188/259/988/15   188/263/988/26   -768/ -5/760/29   524282
XX  4 <->  0   188/271/996/40   188/255/996/20   -768/ 16/764/45   524275
XX  5 <->  0   192/271/952/40   188/255/952/20   -768/ 16/768/45   524280
XX  6 <->  0   196/268/992/16   200/280/992/23   -768/-12/768/27   524276
XX  7 <->  0   200/268/976/15   196/280/976/22   -768/-12/768/27   524283

The signs of the average skews are reversed perfectly when the direction is
reversed, and the values are reversed almost perfectly too.  This wasn't
so clear on freefall's Xeon with interrupts enabled.  Otherwise, the
behaviour on freefall is similar.

XX  1 <->  0    72/143/476/10    72/143/476/ 5   -728/ -0/752/11   524286
XX  1 <->  2   192/263/944/25   192/258/944/15   -728/  5/752/29   524268
XX  1 <->  3   192/263/948/25   192/258/948/15   -728/  5/752/29   524264
XX  1 <->  4   188/255/996/20   188/274/996/42   -740/-18/760/45   524264
XX  1 <->  5   188/255/996/21   192/273/996/42   -768/-18/760/45   524272
XX  1 <->  6   196/280/996/21   196/270/996/14   -768/ 10/760/23   524270
XX  1 <->  7   196/281/952/20   200/270/952/14   -768/ 11/760/23   524275
XX  2 <->  0   188/258/968/16   192/263/968/25   -768/ -5/760/29   524273
XX  2 <->  1   188/258/928/16   192/263/928/25   -768/ -5/760/29   524268
XX  2 <->  3    72/144/504/ 9    72/144/504/ 5   -768/  0/760/10   524286
XX  2 <->  4   184/251/988/17   188/269/988/40   -768/-19/760/43   524275
XX  2 <->  5   184/251/996/17   184/269/996/40   -768/-18/760/43   524271
XX  2 <->  6   196/263/904/17   196/266/904/14   -768/ -3/760/22   524278
XX  2 <->  7   192/264/944/17   200/266/944/14   -768/ -3/760/22   524275
XX  3 <->  0   188/259/988/15   188/263/988/26   -768/ -5/760/29   524282
XX  3 <->  1   192/258/996/16   192/263/996/25   -768/ -5/760/29   524270
XX  3 <->  2    84/143/480/ 8    52/142/480/ 5   -768/  1/760/ 9   524287
XX  3 <->  4   184/251/960/17   188/270/960/40   -768/-19/764/43   524273
XX  3 <->  5   184/251/980/18   188/269/980/40   -768/-19/764/43   524271
XX  3 <->  6   192/264/940/17   196/266/940/14   -768/ -3/764/21   524276
XX  3 <->  7   196/263/996/17   200/266/996/14   -768/ -3/764/21   524277
XX  4 <->  0   188/271/996/40   188/255/996/20   -768/ 16/764/45   524275
XX  4 <->  1   192/272/968/41   192/255/968/20   -768/ 17/764/45   524279
XX  4 <->  2   188/268/996/39   188/251/996/16   -768/ 17/768/41   524276
XX  4 <->  3   188/268/884/38   188/251/884/16   -768/ 16/768/41   524273
XX  4 <->  5    64/144/640/ 8    52/144/640/ 5   -768/ -1/768/10   524286
XX  4 <->  6   188/284/880/42   192/258/880/15   -768/ 26/768/44   524282
XX  4 <->  7   188/284/980/42   184/258/980/15   -768/ 26/768/44   524275

Latencies even seem to be transitive.  According to the above, CPU0 is
16-18 cycles behind CPUs 4-5 and 10-12 cycles ahead of CPUs 6-7.  26 in
the above is consistent with this (CPU 4 is 25 cycles ahead of CPUs 6-7).

XX  5 <->  0   192/271/952/40   188/255/952/20   -768/ 16/768/45   524280
XX  5 <->  1   192/271/988/40   184/255/988/20   -768/ 16/768/45   524261
XX  5 <->  2   188/268/884/39   188/251/884/16   -768/ 17/768/41   524275
XX  5 <->  3   188/269/972/39   188/251/972/16   -768/ 17/768/41   524273
XX  5 <->  4    72/143/512/ 8    52/142/512/ 5   -768/  1/768/10   524287
XX  5 <->  6   188/282/956/41   192/258/956/15   -768/ 24/768/44   524276
XX  5 <->  7   188/283/812/42   192/258/812/15   -768/ 25/768/44   524269
XX  6 <->  0   196/268/992/16   200/280/992/23   -768/-12/768/27   524276
XX  6 <->  1   200/268/984/16   200/280/984/23   -768/-12/768/27   524266
XX  6 <->  2   196/266/864/15   196/265/864/16   -768/  1/768/21   524273
XX  6 <->  3   196/266/940/15   196/266/940/16   -768/  0/768/21   524275
XX  6 <->  4   192/259/964/16   192/287/964/44   -768/-28/768/46   524277
XX  6 <->  5   188/259/992/16   196/287/992/44   -768/-28/768/47   524274
XX  6 <->  7    88/143/732/ 8    52/143/732/ 6   -768/ -0/768/10   524287
XX  7 <->  0   200/268/976/15   196/280/976/22   -768/-12/768/27   524283
XX  7 <->  1   196/268/996/16   196/279/996/22   -768/-11/768/27   524277
XX  7 <->  2   196/266/976/15   196/265/976/16   -768/  1/768/21   524272
XX  7 <->  3   196/266/988/15   196/265/988/16   -768/  0/768/21   524266
XX  7 <->  4   188/259/860/16   196/287/860/44   -768/-28/768/46   524269
XX  7 <->  5   188/259/928/16   200/287/928/44   -768/-28/768/47   524283
XX  7 <->  6    72/144/992/ 8    52/142/992/ 5   -852/  1/768/ 9   524287

> ...
> I don't think it makes much sense to measure skew on systems with an
> invariant TSC.  If the TSC is not invariant, measuring it once at load
> time is likely not enough.  Based on TSC-invariant systems I looked at,
> the measured skew is small enough that the erroneous deltas probably
> don't cause much harm.  In VMs, the deltas are large enough that they
> create noticeable problems.

I agree, but dtrace always does it.  On a Turion2 system with a non-
invariant TSC that is unusable as a timecounter, dtrace init produces:

tsc_skew[0] = 0
tsc_skew[1] = -397099 (decimal)

and the above program produces garbage (NaNs for division of 0 by 0)
due to all latencies being too large to use.  After usingall latencies,
it gives:

|   CPUs  |      latij     |      latji     |       skew      |   nr   |
|---------+----------------+----------------+-----------------+--------|
  0 <->  1   -1259586/-1259530/1262126/123   1260084/1260136/1262126/126   -2521678/-2519667/  2/130   524287
  1 <->  0   1260089/1260144/-1257494/122   -1259603/-1259538/-1257494/126   -2521678/2519683/2521496/131   524287

The large latencies and skews don't change noticeably on repeated runs.
They are about 1260k.  This is very different from the 397k seen by
dtrace init.

The above output was produced by:

XX /** pthread and cpu management parts copied from tscdrift.c. */
XX 
XX /*-
XX  * Copyright (c) 2014 Advanced Computing Technologies LLC
XX  * Written by: John H. Baldwin <jhb at FreeBSD.org>
XX  * All rights reserved.
XX  *
XX  * Redistribution and use in source and binary forms, with or without
XX  * modification, are permitted provided that the following conditions
XX  * are met:
XX  * 1. Redistributions of source code must retain the above copyright
XX  *    notice, this list of conditions and the following disclaimer.
XX  * 2. Redistributions in binary form must reproduce the above copyright
XX  *    notice, this list of conditions and the following disclaimer in the
XX  *    documentation and/or other materials provided with the distribution.
XX  *
XX  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
XX  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
XX  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
XX  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
XX  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
XX  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
XX  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
XX  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
XX  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
XX  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
XX  * SUCH DAMAGE.
XX  */
XX 
XX #include <sys/param.h>
XX #include <sys/cpuset.h>
XX 
XX #include <machine/atomic.h>
XX #include <machine/cpufunc.h>
XX 
XX #include <assert.h>
XX #include <err.h>
XX #include <fcntl.h>
XX #include <inttypes.h>
XX #include <math.h>
XX #include <pthread.h>
XX #include <stdio.h>
XX #include <stdlib.h>
XX #include <string.h>
XX 
XX #define LOGSIZE	0x100000
XX 
XX static volatile u_int nready;
XX static int ncpu;
XX static volatile uint64_t tsc_log[LOGSIZE];
XX static volatile uint tsc_loghead;
XX 
XX /* Bind the current thread to the specified CPU. */
XX static void
XX bind_cpu(int cpu)
XX {
XX 	cpuset_t set;
XX 
XX 	CPU_ZERO(&set);
XX 	CPU_SET(cpu, &set);
XX 	if (cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(set),
XX 	    &set) < 0)
XX 		err(1, "cpuset_setaffinity(%d)", cpu);
XX }
XX 
XX static void *
XX action(void *arg)
XX {
XX 	int cpu, fd, head, master;
XX 
XX 	cpu = (intptr_t)arg;
XX 	master = cpu & 0x80000000;
XX 	cpu &= 0x7fffffff;
XX 	bind_cpu(cpu);
XX 
XX 	fd = open("/dev/io", O_RDONLY);
XX 
XX 	/* Attempt to fault in everything before disabling interrupts. */
XX 	memset((void *)tsc_log, 0, sizeof(tsc_log));
XX 
XX 	/* Start logging on all CPUs at almost the same time. */
XX 	atomic_add_int(&nready, 1);
XX 	while (atomic_load_int(&nready) < 2)
XX 		;
XX 
XX 	/* If possible, run throughout with interrupts disabled. */
XX 	if (fd != -1)
XX 		disable_intr();
XX 
XX 	head = master ? 0 : 1;
XX 	for (;;) {
XX 		if (head >= LOGSIZE)
XX 			break;
XX 		while (atomic_load_acq_int(&tsc_loghead) != head)
XX 			;
XX 		__asm __volatile("lfence");
XX 		tsc_log[head] = rdtsc();
XX 		atomic_store_rel_int(&tsc_loghead, head + 1);
XX 		head += 2;
XX 	}
XX 	if (fd != -1)
XX 		enable_intr();
XX 	return (NULL);
XX }
XX 
XX int
XX main(int argc, char **argv)
XX {
XX 	cpuset_t all_cpus;
XX 	pthread_t *children;
XX 	double avlatij, avlatji, avskew, sdlatij, sdlatji, sdskew;
XX 	double sumlatijsq, sumlatjisq, sumskewsq;
XX 	int *cpus;
XX 	intmax_t latij, latji, maxlatij, maxlatji, minlatij, minlatji;
XX 	intmax_t sumlatij, sumlatji;
XX 	intmax_t skew, maxskew, minskew;
XX 	int error, i, j, nr, r;
XX 
XX 	/*
XX 	 * Find all the CPUs this program is eligible to run on and use
XX 	 * this as our global set.  This means you can use cpuset to
XX 	 * restrict this program to only run on a subset of CPUs.
XX 	 */
XX 	if (cpuset_getaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, -1,
XX 	    sizeof(all_cpus), &all_cpus) < 0)
XX 		err(1, "cpuset_getaffinity");
XX 	for (ncpu = 0, i = 0; i < CPU_SETSIZE; i++)
XX 		if (CPU_ISSET(i, &all_cpus))
XX 			ncpu++;
XX 	cpus = calloc(ncpu, sizeof(*cpus));
XX 	children = calloc(ncpu, sizeof(*children));
XX 	for (i = 0, j = 0; i < CPU_SETSIZE; i++)
XX 		if (CPU_ISSET(i, &all_cpus)) {
XX 			assert(j < ncpu);
XX 			cpus[j] = i;
XX 			j++;
XX 		}
XX 
XX 	printf(
XX "|   CPUs  |      latij     |      latji     |       skew      |   nr   |\n");
XX 	printf(
XX "|---------+----------------+----------------+-----------------+--------|\n");
XX 	for (i = 0; i < ncpu; i++) {
XX 		for (j = 0; j < ncpu; j++) {
XX 			if (j == i)
XX 				continue;
XX 			atomic_store_rel_int(&nready, 0);
XX 			atomic_store_rel_int(&tsc_loghead, 0);
XX 			error = pthread_create(&children[j], NULL, action,
XX 			    (void *)(intptr_t)cpus[j]);
XX 			if (error)
XX 				errc(1, error, "pthread_create");
XX 			action((void *)(intptr_t)(0x80000000 | cpus[i]));
XX 			error = pthread_join(children[j], NULL);
XX 			if (error)
XX 				errc(1, error, "pthread_join");
XX 
XX 			nr = 0;
XX 			maxlatij = maxlatji = INTMAX_MIN;
XX 			minlatij = minlatji = INTMAX_MAX;
XX 			sumlatij = sumlatji = 0;
XX 			sumlatijsq = sumlatjisq = sumskewsq = 0;
XX 			for (r = 0; r < LOGSIZE - 2; r += 2) {
XX 				latij = tsc_log[r + 1] - tsc_log[r];
XX 				latji = tsc_log[r + 2] - tsc_log[r + 1];
XX #if 1 /* XXX discard samples which appear to have been preempted */
XX 				if (latij > 999 || latji > 999)
XX 					continue;
XX #endif
XX 				if (maxlatij < latij)
XX 					maxlatij = latij;
XX 				if (minlatij > latij)
XX 					minlatij = latij;
XX 				if (maxlatji < latji)
XX 					maxlatji = latji;
XX 				if (minlatji > latji)
XX 					minlatji = latji;
XX 				sumlatij += latij;
XX 				sumlatji += latji;
XX 				sumlatijsq += latij * latij;
XX 				sumlatjisq += latji * latji;
XX 				skew = latij - latji;
XX 				if (maxskew < skew)
XX 					maxskew = skew;
XX 				if (minskew > skew)
XX 					minskew = skew;
XX 				sumskewsq += skew * skew;
XX 				nr++;
XX 			}
XX 			avlatij = (double)sumlatij / nr;
XX 			sdlatij = sqrt(sumlatijsq / nr - avlatij * avlatij);
XX 			avlatji = (double)sumlatji / nr;
XX 			sdlatji = sqrt(sumlatjisq / nr - avlatji * avlatji);
XX 			avskew = avlatij - avlatji;
XX 			sdskew = sqrt(sumskewsq / nr - avskew * avskew);
XX 			printf(
XX "%2d <-> %2d   %3jd/%3.0f/%3jd/%2.0f   %3jd/%3.0f/%3jd/%2.0f   %4jd/%3.0f/%3jd/%2.0f   %6d\n",
XX 			    cpus[i], cpus[j],
XX 			    minlatij, avlatij, maxlatji, sdlatij,
XX 			    minlatji, avlatji, maxlatji, sdlatji,
XX 			    minskew, avskew, maxskew, sdskew, nr);
XX 		}
XX 	}
XX }

Unfortunately, the measured skews change a lot.  Some runs gave CPUs 2-3
at skew average distance -105 from CPU 0, but CPUs 4-7 at distance +-0,
And that is with 512K samples.  I expected averaging over a few hundred
samples would be enough.

Bruce