svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386

Thu Mar 21 11:45:02 UTC 2019

On Wed, 20 Mar 2019, Mark Johnston wrote:

> On Wed, Mar 20, 2019 at 08:20:18PM -0700, Rodney W. Grimes wrote:
>>> Author: markj
>>> Date: Thu Mar 21 02:52:22 2019
>>> New Revision: 345359
>>> URL: https://svnweb.freebsd.org/changeset/base/345359
>>>
>>> Log:
>>>   Don't attempt to measure TSC skew when running as a VM guest.
>>>
>>>   It simply doesn't work in general since VCPUs may migrate between
>>>   physical cores.  The approach used to measure skew also doesn't
>>>   make much sense in a VM.

I think it simply doesn't work.

jhb's src/tools/tools/tscdrift/tscdrift.c also simply doesn't work for
measuring drift.  It mostly measures IPC latency and jitter from
preemption in attempts to measure skew.  It might be possible to
determine the current skew using a statistical analysis of the latency
and a few more measurements (measure latency between each CPU in a
pair instead of only from CPU0 to other CPUs), but it would be better
to do more direct measurements.

>> "May" is the important aspect here, and to my knowledge both
>> bhyve and Vmware provide a way to pin cpus there should be
>> a way for us to turn this back on if it is desired.  Sticking
>> it under the big knob vm_guest is probably not the most flexiable
>> solution.
>>
>> Could we please have someway to force this back on?
>
> Even with pinning the skew computation is bogus in a VM.  On an idle
> host running bhyve with pinned vCPUs it gives offsets that are several
> orders of magnitude larger than on the host.  I would prefer to see a
> specific reason for wanting the old behaviour before adding a knob.

The method is bogus (buggy) out of a VM.

XX static void
XX dtrace_gethrtime_init(void *arg)
XX {
XX 	struct pcpu *pc;
XX 	cpuset_t map;
XX 	int i;
XX #endif
XX 
XX 	/* The current CPU is the reference one. */
XX 	sched_pin();

It is also necessary to disable interrupts in hardware.  Or maybe better
and necessary, detect when they occur and discard related samples.  It
is impossible to disable NMIs.  Then there are SMIs.  Not to mention
virtualization.  At best, it gives delays a bit like for SMIs, but longer.

Since this is init code, perhaps interrupts are already masked in hardware,
but to detect drift in the skews, it is necessary to run a routine like this
quite often.

XX 	tsc_skew[curcpu] = 0;
XX 	CPU_FOREACH(i) {
XX 		if (i == curcpu)
XX 			continue;
XX 
XX 		pc = pcpu_find(i);
XX 		CPU_SETOF(PCPU_GET(cpuid), &map);
XX 		CPU_SET(pc->pc_cpuid, &map);
XX 
XX 		smp_rendezvous_cpus(map, NULL,
XX 		    dtrace_gethrtime_init_cpu,
XX 		    smp_no_rendezvous_barrier, (void *)(uintptr_t) i);

Some synchronization is necessary, but this is very heavyweight.  Even
the lightweight synchronization in tscdrift.c has an average latency of
about 150-200 cycles on freefall's Xeon: afer fixing bugs in it:

CPU | TSC skew (min/avg/max/stddev)
----+------------------------------
   0 |     0     0     0    0.000
   1 |    36    65 47913   101.227	# lower latency on same core
   2 |   144   180 52977   139.862
   3 |   147   180 34077   113.543
   4 |   135   181 49335   228.104
   5 |   123   175 33300   91.703
   6 |   165   188 129825   166.999
   7 |   165   189 248010   278.321

tscdrift measures from CPU0 to CPU_n for each n.  After fixing it to
read the TSC on CPU0 before doing an IPC to allow reading the TSC on
CPU_n, this measures mainly the IPC latency.  Actually reading the TSC
takes about 24 cycles.  The skew appears to be 0 for all CPUs.  Anyway,
it is so much smaller than the IPC latency that it can be ignored, provided
it doesn't drift to become non-small, and applications don't try to be
too smart about correcting for the latency.

The unfixed tscdrift allows the other CPU to read the TSC before reading
it itself.  This is more like the above.  It still ends up measuring mainly
IPC latency and other bugs in the program, since the read of the TSC on
the same CPU occurs relatively instantly without the other bugs, and out
of order (this possibly earlier than instantly) with the other bugs.  The
IPC delay is about the same as above, but it takes more complicated
statistics to prove this.

XX 
XX 		tsc_skew[i] = tgt_cpu_tsc - hst_cpu_tsc;

This has even more complicated ordering.  The IPC delay depends on the
internals of smp_rendezvous_cpus().

It should work better to schedule comparisons of 2 CPUs using a third CPU
(this is possible except with precisely 2 CPUs).  2 CPUs sit waiting for
permission to read their TSC.  Another CPU sets a variable to grant
permission.

I plan to try a variation of this: all CPUs spin reading the TSC and
appending the result atomically to a log buffer.  Then analyze the results.
With no skew or interrupts, it is expected that the results are almost
monotonic.  Contention causes non-monotonicity only occasionally.  The
average non-mononicity should be 0.  If there is significant skew, then
the order for the sameples form each pair of CPUs will reverse on almost
every second reading.  The skew is the adjustment needed to stop this
reversal.

Logs of the TSC for a single CPU show that the granularity is about 18
on freefall's Xeon.  I don't know yet if this is more closely related to
the clock rate or the rdtsc latency.  The TSC frequency is 3.3GHz.  I
recently learned that a common implementation uses a shared clock of
frequency only 100 MHz.  With no interpolation of this, the granularity
is 33 cycles.  rdtsc itself takes about 18 cycles, so not much is gained
from any interpolation, but some seems to be done.

On Haswell i4790K at 4 GHz, IIRC rdtsc seemed to take 24 cycles.  I haven't
done the test to see if non-multiples of 24 are returned.  A 100 MHz clock
would have a granularity of 40 cycles, so there is apparently some
interpolation.

A commit log message by mav@ says that some Skylakes never return odd values
for rdtsc, and some appear to have a granularity of about 186 (?) cycles.
freefall's Xeon does return odd values for rdtsc.

XX 	}
XX 	sched_unpin();
XX }
XX ...
XX /*
XX  * DTrace needs a high resolution time function which can
XX  * be called from a probe context and guaranteed not to have
XX  * instrumented with probes itself.
XX  *
XX  * Returns nanoseconds since boot.
XX  */

dtrace does too much hand-rolling here, but it has to do something
to handle the 2 problems mentioned in the comment.  Other subsystems
handle this worse (*).

But it shouldn't try harder than the TSC timecounters to handle skew.
The TSC timecounter ignores the problem of skew, but does other things
more carefully.

XX uint64_t
XX dtrace_gethrtime(void)
XX {
XX 	uint64_t tsc;
XX 	uint32_t lo, hi;
XX 	register_t rflags;
XX 
XX 	/*
XX 	 * We split TSC value into lower and higher 32-bit halves and separately
XX 	 * scale them with nsec_scale, then we scale them down by 2^28
XX 	 * (see nsec_scale calculations) taking into account 32-bit shift of
XX 	 * the higher half and finally add.
XX 	 */
XX 	rflags = intr_disable();
XX 	tsc = rdtsc() - tsc_skew[curcpu];
XX 	intr_restore(rflags);

This is missing the fencing needed to serialize rdtsc().  Unfixed tscdrift
is missing fencing in one place.  Not serializing rdtsc() gives errors of
up to about 80 cycles in variations of tscdrift.  These errors are larger
in the place where the fencing is _not_ missing, due to the details of
the synchronization:

While spinning waiting for the "gate" to open, without fencing rdtsc() is
executed out of order up to about 80 cycles early.  But the "master" usually
doesn't spin for so long, so its unfenced rdtsc (in the unfixed version is
usually not the full 80 cycles early.  The "slave" usually spins for more
than 80 cycles, but its rdtsc is fenced so it is never early.  This
gives a bias towards increasing the measured difference slave_tsc -
master_tsc.  The average difference is over 200 instead of about 180.

Disabling interrupts in the above seems to be to prevent curcpu changing.
The comment says nothing about this, or about anything in the block of
code that it is attached to.

XX 
XX 	lo = tsc;
XX 	hi = tsc >> 32;
XX 	return (((lo * nsec_scale) >> SCALE_SHIFT) +
XX 	    ((hi * nsec_scale) << (32 - SCALE_SHIFT)));

The comment applies to this block if code which it is separated from.

XX }

I think subtracting the skew in the above is worse than useless.  It
works OK only when the skew is 0.  Suppose that the actual skew is 0.
It must be 0, else more important things like the TSC timecounter are
broken.  It is surprising that the TSC timecounter works so well.  Since
it does seem to work, the skew must be 0 initially and remain 0 on
most systems, except probably for jitter in the 1/(100 MHz) range from
the common clock and distributing this clock over lareg distances.

But by attempting to caclulate the skew, we risk calculating only IPC
delays.  The rendezvous function wasn't designed for getting the
precise timing need to minimize these delays or wven to know what they
are.  I looked at it a bit.  Indeed, it has a master-slave relationship
that cannot work as needed:

XX 	/* Pass rendezvous parameters via global variables. */
XX 	smp_rv_ncpus = ncpus;
XX 	smp_rv_setup_func = setup_func;
XX 	smp_rv_action_func = action_func;
XX 	smp_rv_teardown_func = teardown_func;
XX 	smp_rv_func_arg = arg;
XX 	smp_rv_waiters[1] = 0;
XX 	smp_rv_waiters[2] = 0;
XX 	smp_rv_waiters[3] = 0;
XX 	atomic_store_rel_int(&smp_rv_waiters[0], 0);

The slaves are going to have a lot of latency from just parsing this.

XX 
XX 	/*
XX 	 * Signal other processors, which will enter the IPI with
XX 	 * interrupts off.
XX 	 */
XX 	curcpumap = CPU_ISSET(curcpu, &map);
XX 	CPU_CLR(curcpu, &map);
XX 	ipi_selected(map, IPI_RENDEZVOUS);

IPIs give large, variable latency.

XX 	/* Check if the current CPU is in the map */
XX 	if (curcpumap != 0)
XX 		smp_rendezvous_action();

We just sent IPIs to the other CPUs.  They may have executed the
action function already, but more likely they have barely seen the
IPI yet (it is not even an NMI).  Here we execute the action function
for curcpu.  It just reads the TSC for the current CPU.

At least the execution is complicated enough that the rdtsc in the
action function is hard to execute far out of order, so it doesn't
really need fencing.

XX [...]

No problem waiting for the other CPUs.  The problem is getting them
to call the action function with tiny or at least known delays.

I think this ends up caclulating IPC delays much like tscdrift, but
larger due to the very heavyweight IPI method (errors in microseconds
instead of only of 180/3.3 = 50 nanoseconds?).  Other CPUs on the same
core are closer, so the errors from the IPC delays for them are smaller,
but probably the IPI delays dominate the errors.

So attempting to calculate skews seem to turn skews of 0 into large
errors:
- no error for the reference CPU
- an error of approx. -(IPI_delay + memory_IPC_delay + {1 or 2} *
   rdtsc_latency + rdtsc_nonserialization_errors),
where memory_IPC_delays is much larger for CPUs not on the same core
as the "master".

Such enormous errors might not be too bad.  Times read on different
CPUs are just incomparable without establishing an order based on
something like reference events involving a locked variable.  IPCs
for establishing an order take about as long as memory_IPC_delay
for the IPCs to miscalculate the skews.

(*) Other subsystems with worse timestamps:
- KTR uses KTR_TIME which defaults to get_cyclecount().  get_cyclecount()
   is by definition unusable for timestamps.  It was intended to work like
   an x86 TSC, back when the x86 TSC actually was a cycle count of CPU cycles,
   with possibly-variable frequency and the hardware not trying very hard to
   make it comparable across CPUs.  Since then the TSC has been decoupled from
   CPU cycles, so it actually counts time cycles like its name indicates.
   get_cyclecount() is still documented as counting CPU cycles and not
   guaranteeing anything except per-CPU monotonicity.

   On x86, get_cyclecount() usually reduces to rdtsc() (with no fencing but
   a layer or 2 of function calls).  But it can reduce to a timecounter, so
   is unusuable in KTR for another reason -- that KTR needs to be able to
   timestamp low-level things like timecounters.

   Since get_cyclecount() can be anything and might have a variable frequency,
   its current average frequency is unknown.  KTR provids no conversion
   functions to help the user intepret KTR logs with timestamps.

- kern_tslog.c:tslog() uses get_cyclecount() without even a macro for
   replacing it be a usable counter, and spells its variable 'tsc'

- get_cyclecount() sometimes uses a timecounter

- get_cyclecount() is also misused in:
   - dev/ocs_fc/ocs_utils.c (like kern_tslog.c)
   - dev/hwpmc/hwpmc_mod.c (fallback from rdtsc*() for non-x86; also uses
     rdtscp() if available and is missing fencing if only rdtstc() is
     available
   - dev/de/if_devar.h (it wants rdtsc() but that is even less portable.
     nanouptime() is adequate for NOC drivers).
   - dev/random/* (this was the reason for the get_cyclecount() mistake.
     Randomness wants the opposite of a monotonic counter, but there is
     some noise in the low bits of x86 TSCs)
   - kern/subr_bus.c (like dev/random but more justified)
   - netinet/sctp* (like dev/de)

- DELAY() is used at low levels where it is assumed to be safe.  But sometimes
   it uses a timecounter.

- cpu_ticks() is sometimes used instead of misusing get_cyclecount().  This
   is better.  On x86, cpu_ticks() is the same as rdtsc() if the CPU supports
   rdtsc(), down to not fencing it, but with only amd64 inlining
   get_cyclecount() == rdtsc() and neither x86 arch inlining cpu_ticks() as
   rdtsc().

   cpu_ticks() has freqency support.  The main problems with it are that the
   ticks can have a variable rate and no subsystem including the cpu_ticks()
   subsystem "integrates" this rate to get consistent times in seconds, and
   cpu_ticks() may also be a timecounter so it may be unsafe to use in low-
   level contexts.

   The following subsystems use cpu_ticks() instead of a timecounter or a
   hack for timestamps:
   - contrib/ena-com
   - dev/acpi/acpi_cpu.c.  An important use
   - i386/include/cpu.h.  This implements get_cyclecount() as cpu_ticks(),
     since this is more portable than using rdtsc().
   - schedulers use cpu_ticks() since timecounters were too slow when they
     were the i8254.  Now even an HPET is still too slow, and only newer
     x86 can depend on having timecounters that are not too slow.  Schedulers
     still use stathz ticks at a low frequency for actual scheduling since
     all hardware timers were too slow and many still are.

Other complications for reading TSCs:

- rdtscp is not very useful since it is only slightly better than
   lfence; rdtsc on Intel CPUs and mfence; rdtsc on AMD CPUs.  In theory,
   it could be significantly better since it doesn't use global fences.
   It also reads a CPU id atomically with reading the TSC.  This could
   be used for avoiding the interrupt masking in the above skew
   adjustment.

- rdtscp and fences before rdtsc only give "release semantics".
   Another fence may be needed after rdtsc* to give "acquire"
   semantics -- to prevent moving stores before rdtsc*.  I think
   x86 strong memory ordering works just as well here.  E.g.:

     __asm __volatile("lfence");
     volatile uint64_t tsc = rdtsc();
#if 0
     /*
      * Not needed since tsc is volatile to force the memory order of
      * storing to tsc before storing to gate.  Of course, rdtsc() must
      * be ordered before storing its result, so this prevents moving
      * the store to gate before rdtsc().
      */
     __asm __volatile("sfence");
#endif
     atomic_store_int(&gate, OPEN);

For fixing simplistic timing code of the form:

     start = rdtsc();
     do_work();
     finish = rdtsc();
     diff = finish - start;

old references say to use cpuid() to serialize.  cpuid() is slow.  cpuid()
before the first rdtsc() doesn't mess up the result for a simple test like
this, but for general use for timestamps it is just slower so we should
use a fence.  However for timestamps it often doesn't matter if the start
timestamp is early so we might be able to avoid the fence too.

Then we need serialization something like the above to prevent the second
rdtsc() being moved into do_work().  A complete old method using cpuid()
is:

     cpuid();
     start = rdtsc();
     sfence();			/* maybe the wrong fence type */
     do_work();
     finish = rdtscp();
     sfence();			/* maybe the wrong fence type */
     cpuid();			/* just to be safe */
     diff = finish - start;

and I think this can be improved to:

     lfence();			/* mfence() on AMD */
     start = rdtsc();		/* volatile start for ordering */
     do_work();
     lfence();			/* perhaps more portable than rdtscp() */
     finish = rdtscp();		/* volatile finish for ordering */
     diff = finish - start;

and this should also make the acquire and release ordering more explicit.
It assumes no TSC skew.

However, if we only want an upper bound for the time, then we can optimize
this by omitting some fences.  Omitting the first fence only gives an
earlier start and thus a larger diff.  The first volatile, or a fence after
the first rdtsc() seems to be necessary to prevent the first rdtsc() being
moved after loads and stores in do_work().  After do_work(), the reverse
applies -- ommitting the second volatile only gives a later finish, while
the second fence is needed to prevent the second rdtsc() being moved too
early.

For free-running timestamps, ordering like the above is neither necessary
nor sufficient.  The simplistic timing code is now:

     clock_gettime(CLOCK_MONOTONIC, &start);
     do_work();
     clock_gettime(CLOCK_MONOTONIC, &finish);
     diff = finish - start;	/* actually timespecsub() */

and this needs to work without the user writing any ordering directives.
"to work" doesn't seem to mean much in POSIX.  Despite clock_gettime()
being invented for POSIX.1b realtime extensions and these not becoming
very common before 2001, the specifiers apparently didn't dream of
relativistic causality problems for times.  I think all that is required
is for CLOCK_MONOTONIC to give monotonic times.  A useul implementation
should also guarantee program order for code like the above despite it
not having any explicit ordering.  This is not easy when the CPUs running
the code are separated by many light-seconds and the clock can measure
light-seconds.

The above is in a single thread.  Suppose this thread runs on only 1
CPU.  Then skews don't matter, and even fences aren't needed to get
monotonicity.  Sprinkling fences (both load and store) for each
rdtsc() also gives ordering for do_work().

Suppose the critical rdtsc()s in this thread run on different CPUs.
Then skews matter.  Uncorrected or garbage skews larger than the
do_work() plus context switch time give negative or more than
doubled differences.  Since this problem is rarely noticed, the
skews must be usually not that large.  I think the garbage skews
in dtrace are smaller than normal context switch time since IPIs
are faster than switches (except for VM cases).

Errors from not using fences are probably small compared with context
switch times, thus never cause negative differences.  It is necessary
to have fences on both sides of rdtsc() to ensure ordering across CPUs.
Otherwise, in theory the rdtsc() for 'start' might be so late and the
rdtsc() for 'finish' might be so early that finish < start.  For 3
timestamps instead of 2, the one in the middle must have fences on
both sides; thus for a general API, all cases must have fences on
both sides.  In practice, context switches won't be done without a lot
of stores which will order everything without explicit fences.

Similarly for comparing times across threads, except it is skews larger
than the IPC time that cause negative differences.

Bruce