svn commit: r238755 - head/sys/x86/x86

Thu Jul 26 12:31:00 UTC 2012

On Thu, 26 Jul 2012, Konstantin Belousov wrote:

> On Thu, Jul 26, 2012 at 05:35:23PM +1000, Bruce Evans wrote:
>> In fact, there is always a full documented serialization instruction
>> for syscalls, except maybe in FreeBSD-1 compat code on i386, at
>> least on Athlon64.  i386 syscalls use int 0x80 (except in FreeBSD-1
>> compat code they use lcalls, and the iret necessary to return from
>> this is serializing on at least Athlon64.  amd64 syscalls use
>> sysenter/sysret.  sysret isn't serializing (like far returns), at least
>> on Athlon64, but at least in FreeBSD, the syscall implementation uses
>> at least 2 swapgs's (one on entry and one just before the sysret), and
>> swapgs is serializing, at least on Athlon64.
> Yes, SWAPGS is not documented as serializing on Intels. I reviewed

Isn't that too incompatible?

> the whole syscall sequence for e.g. gettimeofday(2), and there is no
> serialization point for fast path. E.g. ast would add locking and thus
> serialization, as well as return by IRET, but fast path on amd64 has
> no such things.

>>> This function was moved around from time to time and now it sits here:
>>>
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=arch/x86/vdso/vclock_gettime.c
>>>
>>> It still carries one barrier before rdtsc.  Please see the comments.
>>
>> For safety, you probably need to use the slowest (cpuid) method.  Linux
>> seems to be just using fences that are observed to work.
> No, there is explicit mention of the recommended barriers in the vendor
> documentation, which is LFENCE for Intels, and MFENCE for AMDs. My patch
> just follows what is suggested in documentation.

But you say later theat CPUID is needed (instead of just lock?).  The
original Athlon64 manual doesn't seem to mention MFENCE for RTDSC.
Maybe later manuals clarify that MFENCE works on old CPUs too.

> [Replying to other mail in-place, the thread goes wild]

Too much quoting :-).

> On Thu, Jul 26, 2012 at 04:25:01PM +1000, Bruce Evans wrote:
>> ...
>> For the threaded case, there has to something for the accesses to be
>> provably ordered.  It is hard to see how the something can be strong
>> enough unless it serializes all thread state in A and B.  The rdtsc
>> state is not part of the thread state as know to APIs, but it is hard
>> to see how threads can serialize themselves without also serializing
>> the TSC.
> TSC timer read is not synchronized, and I found the Linux test for the
> thing I described above. Adopted version is available at
> http://people.freebsd.org/~kib/misc/time-warp-test.c.
> It shall be compiled in 32bit mode only.

My point is that it will normally be synchronized by whatever the threads
do to provide synchronization for themself.  Only the case of a single
thread doing sequential timer reads should expect the reads to be
monotonic without any explicit synchronization.  I hope this case doesn't
require stalling everything in low-level code.

> On my Nehalem workstation, I get enormous amount of wraps reported for
> RDTSC without CPUID. Adding CPUID back fixes the issue. So at least on
> Nehalems (and probably Westmere, I will test later today) RDTSC can even
> pass LOCKed instructions.

Oh, you mean with the test program, that it needs CPUID because it only
has locks and no fences and its CPUID is commented out.

> Curiously enough, SandyBridge is sane and reports zero wraps, it seems
> Intel fixed the bug.

The original Athlon64 manual doesn't seem to mention locks being sufficient
any more than it mentions fences.

>> I care about timestamps being ordered more than most people, and tried
>> to kill the get*time() APIs because they are weakly ordered relative
>> to the non-get variants (they return times in the past, and there is
>> no way to round down to get consistent times).  I tried to fix them
>> by adding locking and updating them to the latest time whenever a
>> non-get variant gives a later time (by being used).  This was too slow,
>> and breaks the design criteria that timecounter calls should not use
>> any explicit locking.  However, if you want slowness, then you can get
>> it similarly by fixing the monotonicity of rdtsc in software.  I think
>> I just figured out how to do this with the same slowness as serialization,
>> if a locked instruction serialzes; maybe less otherwise:
>>
>> spin:
>> 	ptsc = prev_tsc;	/* memory -> local (intentionally !atomic) */
>> 	tsc = rdtsc();		/* only 32 bits for timecounters */
>> 	if (tsc <= ptsc) {	/* I forgot about wrap at first -- see below
>> 	*/
>> 		/*
>> 		 * It went backwards, or stopped.  Could handle more
>> 		 * completely, starting with panic() to see if this
>> 		 * happens at all.
>> 		 */
>> 		return (ptsc);	/* stopped is better than backwards */
>> 	}
>> 	/* Usual case; update (32 bits). */
>> 	if (atomic_cmpset_int(&prev_tsc, ptsc, tsc))
>> 		return (tsc);
>> 	goto spin;
> I do not understand this. Algorithm is clear, but what you propose is
> very heavy-weight comparing with adding just LFENCE or MFENCE before rdtsc.
> First, the cache-line for prev_tsc becomes heavy-contended. Second, CAS
> is expensive. LFENCE is fully local to the core it executes on.

I expect the contention to be rare, but then optimization isn't important
either.

But if the problem is fully local, as it apparently is for fences to
fix it, then prev_tsc can be per-CPU with a non-atomic cmpset to access
it.  We don't care if rdtsc gives an old value due to some delay in
copying the result to EDX:EAX any more than we care about an old value
due to being interrupted.  The case where we are interrupted, and
context-switched, and come back on a different CPU, is especially
interesting.  Then we may have an old tsc value from another CPU.
Sometimes we detect that it is old for the new CPU, sometimes not.
There is a problem in theory but I think none in practice.  The switch
could set a flag to tell us to loop (set prev_tsc to a sentinel value),
and it accidentally already does, except with serious our of orderness:
switches happen to always call the TSC if the TSC is usable, to maintain
the the pcpu switchtime variable.  The call for this will give a tsc
value for the new CPU that is in advance of the old one, provided there
all the TSC are in sync and there is sufficient monotonicity across
CPUs.  Then the loop will see that its tsc is old, and repeat.

I am only half serious in proposing the above.  If you want to be slow
then you can do useful work like ensuring monotonicity across all CPUs
in much the same time that is wasted by stalling the hardware until
everything is serialized.

Bruce