svn commit: r285664 - in head/sys: kern sys

Sat Jul 18 10:55:18 UTC 2015

On Sat, 18 Jul 2015, Mark Johnston wrote:

> Log:
>  Pass the lock object to lockstat_nsecs() and return immediately if
>  LO_NOPROFILE is set. Some timecounter handlers acquire a spin mutex, and
>  we don't want to recurse if lockstat probes are enabled.

It is an error to call timecounter code from a low-level place like the
mutex implementation.  This workaround depends on all locks in timecounter
handlers being LO_NOPROFILE, and that breaks profiling of these locks.
The breakage is largest if the locks are used for more than timecounters.
E.g., the one for i386's obsolescent i8254 timer has to be used for all
accesses to the i8254.  This lock is configured as MTX_SPIN | MTX_NOPROFILE
and is perhaps the main one fixed by this commit.

KTR handles this problem badly in a different way.  It abuses
get_cyclecount().  The abuse is parametrized by defining KTR_TIME as
get_cyclecount() and using KTR_TIME.  The definition of KTR_TIME is
ifdefed but not in a usable way.  This is of course undocumented, and
KTR_TIME is not a normal kernel option so it is hard to define in
config files.

KTR needs to be about as careful about this as lockstat since it is
called from low level mutex code for debugging.  I think there is nothing
like LO_NOPROFILE to turn it off.  Of course you can turn off the mutex
code's calls to it using MTX_QUIET or MTX_NOWITNESS, but this breaks much
more than lock profiling.  (I rarely use any lock debugging, but when
I did I found it useful to use it without WITNESS_SKIPSPIN and with
MTX_QUIET and MTX_NOWITNESS turned off for the locks being debugged.
Some console drivers use spinlocks with MTX_QUIET or MTX_NOWITNESS to
avoid various problems, and this breaks witnessing of them even for
non-console uses.  Unbreaking their witnessing uncovered further bugs.)

Abusing get_cyclecount() in KTR doesn't even fix the problem with
timecounters, since some implementations of get_cyclecount() (some
arm, some i386) use the timecounter.

Other bugs with using get_cyclecount():
- even if it returns cycles counted by a clock, the frequency of this
   clock is unknown/unobtainable, so there is no way to convert to
   human-readable units
- some arches return a shifted binuptime() that wraps very quickly
   (256 seconds for arm with ARM_ARCH < 6).  Such counters cannot be
   monotonic.
- some arches (arm again?) used to use binuptime() with swizzling more
   complicated than shifting.  Their counters were further from being
   monotonic.
- arm is now the only arch that uses binuptime() at a low level.
   arm64 use a stub that returns 1.  ARM_ARCH < 6 does the above.  i386
   returns cpu_ticks().  This is the correct method.  It uses the
   hardware counter use by cpu_ticks() if possible, else the raw value
   of the current hardware timecounter, adjusted to avoid wrap.
     (This is not actually a correct method, since its implementation is
     no good.  The wrap adjustment is down with no locking.  Also,
     switching the timecounter hardware breaks the get_cyclecount()
     use even more than the thread resource usage use.  The latter
     has incomplete recalibration.)
- KTR of course has no support for converting to human-readable units.
   ktrdump prints raw timestamps and their differences in %16ju format.
   kern_ktr.c seems to print them only in a ddb command.  This prints
   the raw value in the much worse format %10.10lld.  This uses the
   long long abomination and has a sign error (ktr_timestamp has type
   uint64_t).
- ktr's "timestamps" are thus at best a cookie that increases strictly
   monotonically
- get_cyclecount() is always rdtsc() on modern x86.  rdtsc() is not
   a serializing instruction.  I recently learned how unserial it can
   be -- hundreds of cycles on freefall.  The hundred-cycle case probably
   doesn't happen much in KTR's use.  It happens in the test program
   when the program spins waiting for something and executes rdtsc()
   after that, or so it thinks.  The CPU actually executes rdtsc()
   speculatively hundreds of cycles earlier while spinning.  Back to
   back rdtsc()s seem to be executed in order on freefall, and the
   speculation seems to extend to only the first one.  So there is
   a good chance that get_cyclecount() called on the same CPU gives
   a strictly increasing value.  But across CPUs, the hundred-cycle
   difference are the usual case if there is no synchronization of
   the threads (even when the TSCs on the different CPUs are perfectly
   synchronized at the hardware level).
- get_cyclecount() doesn't bother with synchronization if it uses the
   TSC (or equivalent).  This is a feature.  It is supposed to be
   efficient even if this requires it to return garbage.  But this makes
   it unusable for almost everything.  Certainly for timestamps.
- get_cyclecount()'s documentation has rotted to match its misuse.  It
   is now documented as returning a value that is monotonically
   increasing inside each CPU.  But it never obviously did that on x86
   (rdtsc isn't a serializing instruction, and the implementation doesn't
   even have a compiler barrier), and it doesn't do that on arm < 6
   (wrap occurs after 256 seconds).  Its documentation does a good job
   of describing its non-monotonicity across CPUs.  cpu_ticks() also
   doesn't require any synchronization across CPUs.  This is less clear
   since it has null documenation.

   KTR shouldn't try to synchronize threads to get timestamps that are
   monotonic across CPUs relative to some shared (non-relativistic) clock,
   although this would be useful for debugging synchronization, since
   this would be slow (100-200 cycles per sync on freefall) and would
   break debugging of synchronization bugs by adding synchronization.

   Timecounters give timestamps that are almost comparable across CPUs,
   even when they use the TSC timecounter.  They do this by reading a
   shared hardware timecounter, perhaps even with a lock that serializes
   everything, or by reading something like a shared TSC that is not
   fully serialized.  For the TSC, low-level code gives serialization
   for each CPU.  The TSCs ar hopefully synced at the hardware level,
   so that when you read them at almost the same physical time on separate
   CPUs you get almost the same value.  The physical time might be affected
   relativistically, but CPUs are not yet quite fast enough for that to
   be a large effect.  The largest effect is probably from pipeline
   delays.

Bruce