Use fences for kernel tsc reads.

Sun Jul 29 04:58:41 UTC 2012

On Sat, 28 Jul 2012, Konstantin Belousov wrote:

> This was discussed on somewhat unwieldly thread on svn-src@ as a followup
> to the commit r238755 which uncovered the problem in the first place.
>
> Below is the commit candidate. It changes the fix in r238755 to use CPUID
> instead of rmb(), since rmb() translates to locked operation on i386,
> and experimentation shown that Nehalem's RDTSC can pass LOCK.

At least remove TSC-low instead of building another layer of pessimizations
and source bloat on top of it.

I might not mind even more source code bloat for TSC-low and fences if
it were runtime tests to avoid using them unless necessary.  At least
sysctls to avoid using them.  When the kernel selects TSC-low, ordinary
TSC becomes unavailable.

> ...
> Handling of usermode will follow later.

I hesitate to mention that this doesn't pessimize all uses of rdtsc:
- x86/isa/clock.c uses raw rdtsc32() for DELAY()
- x86/x86/mca.c uses raw rdtsc() for mca_check_status()
- x86/x86/tsc.c uses raw rdtsc() for:
   - calibrating the TSC frequency.  The calibration is otherwise sloppy
     and would only be slightly affected by additional sloppiness or its
     removal.
   - the CPU ticker.  A function pointer is used and the address of the
     static inline rdtsc() is taken.  It becomes not so raw -- a normal
     static function.  I don't like the CPU ticker.  It gives huge
     inaccuracies in some cases.  Most uses of it go through calcru(),
     and it ameliorates some of the bugs.  It enforces monotonicity
     of rusage times for other reasons.  When the inaccuracies or overflow
     causes the time to go backwards, calcru() complains about some cases.
     This is about the only place in the kernel that checks for the time
     going backwars.

     Most uses of the CPU ticker are for setting switchtime on context
     switches.  These uses are far apart, and near heavy locking, and
     on a single CPU, so the TSC as read by them can probably never
     go backwards.  (The same magic that allows switching to be per-CPU
     should allow the TSC to be per-CPU.)
- {amd64,i386}/include/cpu.h uses the CPU ticker for get_cyclecount().
   I don't like get_cyclecount().  It is a bad API resulting from
   previous mistakes in this area.  It is basically the TSC spelled
   differently so that the TSC could be used when the TSC timecounter
   was not properly exported or the hardware doesn't have it.  x86
   hardware doesn't have it, and non-x86 might name it differently even
   if it has it.  When the hardware doesn't have it, a weak replacement
   like the current time counter is used.  But this makes it too slow
   to use, since all places that use it use it because they don't want
   slowness (otherwise they would just use a timecounter directly).
   The selection used to be ifdefed in i386/include/cpu.h.  Now, it
   just uses the CPU ticker on 386 (it still uses a raw rdtsc() on
   amd64).  It is actually documented in section 9, but the much more
   important CPU ticker isn't.  Its comment in at least the i386
   machine/cpu.h still says that it returns a "bogo-time" for random
   harvesting purposes, but the CPU ticker time is not so bogus (it
   must be good enough for thread accounting, using deltas on the same
   CPU), and of course this mistake is now used for more than random
   harvesting.

   On other arches, the implementation of get_cyclecount() is differently
   bogus:
   - on amd64, it is an inline function that calls the inline rdtsc()
   - on arm, it is an inline function that always uses binuptime() and
     never goes through the CPU ticker
   - on ia64, it is #defined as a raw counter API spelled without `()'
     so it looks like a variable, but I guess it is a function
   - on i386, it is an inline function that calls the non-inline
     cpu_ticks(), so it is slower than the old ifdefed version.
     cpu_ticks() goes through at least a function pointer to reach
     the non-inline rdtsc().
   - on mips, it is #defined as a raw counter API spelled with `()'
   - on powerpc, it is a dedicated inline function with inline asm
   - on sparc64, it is an inline function that calls a raw counter API
   That is a lot of differently spelled bloat to support an API that
   should never have existed.  All arches could have just the same
   slowness as i386, with no MD code, by translating it to
   cpu_ticks() as on i386.  Or callers of it could abuse cpu_ticks()
   directly.

   get_cyclecount() has escaped from random harvesting to the following
   places:
   - dev/de/if_devar.h: used for timestamps under a performance-testing
     option that seems to not be the default (so hardly anyone except
     its developer ever used it).  Old (10 or 100 Mbps here?) NIC drivers
     don't need very efficient timestamps, and it would be nice if they
     weren't in unscaled bogotime.  I use microtime() in not-so-old
     (1Gbps) NIC drivers.  This with a TSC works well for timestamping
     events a few usec apart.  It wouldn't work so well with an APIC
     timecounter taking about 1 usec to read.  But the APIC timecounter
     should be fast enough for this use at 100 Mbps.  The i8254 timecounter
     is 5 times slower again, but fast enough at 10 Mbps (original de?).
   - dev/random/harvest.c: the original use.  A cycle counter seems like
     an especially bad choice for randomness.  It just has a bit or two
     of randomness in its low bits.  The serialization bug makes it better
     for this.
   - dev/random/randomdev_soft.c: another original use.
   - kern/init_main.c: seeding for randomdev.
   - kern/kern_ktr.c: for timestamps that should at least be monotonic
     and unique even if they are in unscaled bogotime.  get_cyclecount()
     is highly unsuitable for this, especially when it uses a timecounter
     (don't put any ktr calls in timecounter code, else you might
     get endless recursion).  Apparently, get_cyclecount() was used
     because it was the only portable API to reach the hardware name.
     For this use, get_cyclecount() should not be faked.  Serializing
     might be needed here.  TSC-low lossage is unwanted here -- it might
     break uniqueness.
   - netinet/sctp_os_bsd.h: for tracing.  Like ktr, except it doesn't
     need to worry about recursion.  Perhaps can be more like NIC
     drivers should be and just use a timecounter.

   Anyway, the unserialized TSC serves for providing bogo-time for
   get_cyclecount() even better than it serves for providing time
   for the CPU ticker.

- i386/i386/machdep.c uses raw rdtsc() for recalibrating the TSC frequency.
   Similar sloppyness to initial calibration.
- i386/i386/perfmon.c: TSC ticks are just an an especially important event.
   Serializing would break performance testing by changing the performance
   significantly.  Since this is only for a user API, the breakage from
   the increased overhead would be dominated by existing overheads, as
   for the clock_getttime() syscall.
- include/xen/xen-os.h: home made version of rdtsc() with different
   spelling.  To change the spelling, it should use the standard inline
   and not repeat asm.
- isa/prof_machdep.c.  Like perfmon, but now rdtsc() is called on every
   function call and return for high resolution profiling, and there
   is no slow syscall to already slow this down.  Function call lengths
   down to about 10 nsec can be measured reliably on average.  The
   overhead is already much more than this, but is compensated for, so
   lengths of 10 nsec might remain measureable, but the overhead is
   already too much and adding to it wouldn't help, and might disturb
   the performance too much.  This use really wants serialization, and
   out-of-order timestamps might show up as negative function call times,
   but they never have.

   High resolution kernel profiling never worked for SMP (except in my
   version, it works except for excessive lock contention) and was
   broken by gcc-4 changing the default compiler options, so this problem
   can be ignored for now.

Please trim lots of the above if you reply.

> diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
> index c253a96..101cbb3 100644
> --- a/sys/x86/x86/tsc.c
> +++ b/sys/x86/x86/tsc.c
> ...
> @@ -328,15 +344,26 @@ init_TSC(void)
>
> #ifdef SMP
>
> -/* rmb is required here because rdtsc is not a serializing instruction. */
> +/*
> + * RDTSC is not a serializing instruction, so we need to drain
> + * instruction stream before executing it. It could be fixed by use of
> + * RDTSCP, except the instruction is not available everywhere.

Sentence breaks are 2 spaces in KNF.  There was a recent thread about this.

I used to police style in this file more, and it mostly uses 2 spaces.

> + *
> + * Use CPUID for draining. The timecounters use MFENCE for AMD CPUs,
> + * and LFENCE for others (Intel and VIA) when SSE2 is present, and
> + * nothing on older machines which also do not issue RDTSC
> + * prematurely. There, testing for SSE2 and vendor is too cumbersome,
> + * and we learn about TSC presence from CPUID.
> + */

RDTSC apparently doesn't need full serialization (else fences wouldn't
be enough, and the necessary serialization would be slower).  The
necessary serialization is very complex and not fully described above,
but the above is not a good place to describe many details.  It already
describes too much.  It is obvious that we shouldn't use large code or
ifdefs to be portable in places where we don't care at all about
efficiency.  Only delicate timing requirements would prevent us using
the simple CPUID in such code.  For example, in code very much like
this, we might need to _not_ slow things down, since we might want
to see small differences between the TSCs on different CPUs.

> @@ -592,23 +628,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS)
> SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW,
>     0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter frequency");
>
> -static u_int
> +static inline u_int
> tsc_get_timecount(struct timecounter *tc __unused)
> {
>
> 	return (rdtsc32());
> }

Adding `inline' gives up to 5 (!) style bugs at different levels:
- `inline' is usually spelled `__inline'.  The former is now standard,
   but sys/cdefs.h never caught up with this and has better magic for
   the latter.  And `inline' doesn't match most existing spellings.
- the function is still forward-declared without `inline'.  The differenc
   is confusing.
- forward declaration of inline functions is nonsense.  It sort of asks
   for early uses to be non-inline and late uses to be inline.  But
   gcc now doesn't care much about the order, since -funit-at-a-time is
   the default.
- this function is never actually inline.  You added calls to the others
   but not this.
- if this function were inline, then this might happen automatically due
   to -finline-functions-called-once.  See below.

>
> -static u_int
> +static inline u_int
> tsc_get_timecount_low(struct timecounter *tc)
> {
> 	uint32_t rv;
>
> 	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> -	: "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> +	    : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> 	return (rv);
> }

- as above, except now the function can actually be inline.  It is still
   derefenced, so it is not actually inline in all cases.  You added 2
   calls to it, so it is inlined in 2 cases.  gcc now inlines functions
   called once, especially when you don't want this.  Here this is wanted.
   I don't know if gcc considers a use to be "once" if there is a
   single inlineable use and also a non-inlineable use.  I miscounted the
   number of uses at first.  There are now 2 inlineable cases, and
   although inlining would be good, it is no longer "once" so gcc shouldn't
   do it without an explicit __inline.

>
> +static inline u_int
> +tsc_get_timecount_lfence(struct timecounter *tc __unused)
> +{
> +
> +	lfence();
> +	return (rdtsc32());
> +}

Here and for mfence, you call the raw function.  That's why
tsc_get_timecount() is never inline.  I first thought that the patch
was missing support for plain TSC.

I also don't like the rdtsc32() API :-).  Since it is inline, compilers
should be able to ignore the high bits in rdtsc().  But we're micro-
optimizing this to get nice looking code and save a whole cycle or so.
I'm normally happy to sacrifice a cycle for cleaner source code but
not for 4 cycles :-).  If the extra code were before the fence, then
its overhead would be lost in the stalling for the fence.  Perhaps
similatly without fences.  Since it is after, it is more likely to
have a full cost, with a latency of many more than 1 cycle now not
hidden by out-of-order execution.

> +
> +static inline u_int
> +tsc_get_timecount_low_lfence(struct timecounter *tc)
> +{
> +
> +	lfence();
> +	return (tsc_get_timecount_low(tc));
> +}
> +
> +static inline u_int
> +tsc_get_timecount_mfence(struct timecounter *tc __unused)
> +{
> +
> +	mfence();
> +	return (rdtsc32());
> +}
> +
> +static inline u_int
> +tsc_get_timecount_low_mfence(struct timecounter *tc)
> +{
> +
> +	mfence();
> +	return (tsc_get_timecount_low(tc));
> +}
> +
> uint32_t
> cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th)
> {
>

I don't like the indirections for using these functions, but we already
have them.  Linux would modify the instruction stream according to a
runtime probe reduce it to an inline rdtsc in binuptime() etc. if
possible (otherwise modify to add fences as necessary).

Not inlining rdtsc like we do probably reduces serialization problems.

Bruce