svn commit: r238755 - head/sys/x86/x86

Thu Jul 26 06:25:05 UTC 2012

On Wed, 25 Jul 2012, Konstantin Belousov wrote:

> On Wed, Jul 25, 2012 at 11:00:41AM -0700, Jim Harris wrote:
>> On Wed, Jul 25, 2012 at 10:32 AM, Konstantin Belousov
>> <kostikbel at gmail.com> wrote:
>>> I also asked Jim to test whether the cause the TSC sync test failure
>>> is the lack of synchronization between gathering data and tasting it,
>>> but ut appeared that the reason is genuine timecounter value going
>>> backward.
>>
>> I wonder if instead of timecounter going backward, that TSC test
>> fails because CPU speculatively performs rdtsc instruction in relation
>> to waiter checks in smp_rendezvous_action.  Or maybe we are saying
>> the same thing.
>
> Ok, the definition of the 'timecounter goes back', as I understand it:
>
> you have two events A and B in two threads, provable ordered, say, A is
> a lock release and B is the same lock acquisition. Assume that you take
> rdtsc values tA and tB under the scope of the lock right before A and
> right after B. Then it should be impossible to have tA > tB.

For the threaded case, there has to something for the accesses to be
provably ordered.  It is hard to see how the something can be strong
enough unless it serializes all thread state in A and B.  The rdtsc
state is not part of the thread state as know to APIs, but it is hard
to see how threads can serialize themselves without also serializing
the TSC.

For most uses, the scope of the serialization and locking also needs
to extend across multiple timer reads.  Otherwise you can have situations
like:

 	read the time
 		interrupt or context switch
 			read later time in other intr handler/thread
 			save late time
 		back to previous context
 	save earlier time

It is unclear how to even prevent such situations.  You (at least, I)
don't want heavyweight locking/synchronization to prevent the context
switches.  And the kernel rarely if ever does such synchronization.
binuptime() has none internally.  It just spins if necessary until the
read becomes stable.  Most callers of binuptime() just call it.

> I do not think that we can ever observe tA > tB if both threads are
> executing on the same CPU.

I thought that that was the problem, with a single thread and no context
switches seeing the TSC go backwards.  Even then, it would take
non-useful behaviour (except for calibration and benchmarks) like
spinning executing rdtsc to see it going backwards.  Normally there
are many instructions between rdtsc's and the non-serialization isn't
as deep as that.  Using syscalls, you just can't read the timecounter
without about 1000 cycles between reads.  When there is a context switch,
there is usually accidental serialization from locking.

I care about timestamps being ordered more than most people, and tried
to kill the get*time() APIs because they are weakly ordered relative
to the non-get variants (they return times in the past, and there is
no way to round down to get consistent times).  I tried to fix them
by adding locking and updating them to the latest time whenever a
non-get variant gives a later time (by being used).  This was too slow,
and breaks the design criteria that timecounter calls should not use
any explicit locking.  However, if you want slowness, then you can get
it similarly by fixing the monotonicity of rdtsc in software.  I think
I just figured out how to do this with the same slowness as serialization,
if a locked instruction serialzes; maybe less otherwise:

spin:
 	ptsc = prev_tsc;	/* memory -> local (intentionally !atomic) */
 	tsc = rdtsc();		/* only 32 bits for timecounters */
 	if (tsc <= ptsc) {	/* I forgot about wrap at first -- see below */
 		/*
 		 * It went backwards, or stopped.  Could handle more
 		 * completely, starting with panic() to see if this
 		 * happens at all.
 		 */
 		return (ptsc);	/* stopped is better than backwards */
 	}
 	/* Usual case; update (32 bits). */
 	if (atomic_cmpset_int(&prev_tsc, ptsc, tsc))
 		return (tsc);
 	goto spin;

The 32-bitness of timecounters is important for the algorithm, and for
efficiency on i386.  We assume that the !atomic read gives coherent
bits.  The value may be in the past.  When tsc <= ptsc, the value is
in the future, so value must be up to date, unless there is massive
non-seriality with another CPU having just written a value more up
to date than this CPU read.  We don't care about this, since losing
this race is no different from being preempted after we read.  When
tsc > ptsc, we want to write it as 32 bits to avoid the cmpxchg8b
slowness/unportability.  Again, the value may be out of date when
we try to update it, because we were preempted.  We don't care about
this either, as above, but detected some cases as a side effect of
checking that ptsc is up to date.  Normally ptsc was up to date when
it was read, but it could easly be out of date when it was checked
by the cmpset, iff another CPU updated it.

This would enforce monotonicity across multiple CPUs even if their TSCs
are very unsynchronized, but you wouldn't want it much then.  But if they
are synchronized to within the serialization time or the 1<<shift time,
then this synchronizes them quite well.

Oops, better add a wraparound check in the comparison.  With only 32
bits, tsc often wraps.  Use 64 bits for the test and depend on TSCs
starting at 0, or any number of bits and a heuristic.  This now gets
complicated.  Elswhere we depend on timecounters being called often
enough to prevent them wrapping without us noticing, and use dummy
calls to warm them up.  The heuristic would depend on being called
often enough.  Warming up wouldn't as automatic as reading the
counter -- it would need to reset the heuristi.

Timecounters can go backwards when the scaling of the count for them
overflows, or when the counter wraps.  This can be detected as a bug,
by saving the previous time as above (but now need more bits).  This
was intentionally not done because the locking for it is slow.  If
backwardness is detected, not much can be done except return the most
forward time read, as above.  Maybe add epsilon each time.

Bruce