svn commit: r219679 - head/sys/i386/include

Bruce Evans brde at optusnet.com.au
Thu Mar 17 12:42:30 UTC 2011


On Wed, 16 Mar 2011, Jung-uk Kim wrote:

> On Wednesday 16 March 2011 01:45 pm, Roman Divacky wrote:
>> On Wed, Mar 16, 2011 at 12:32:56PM -0400, Jung-uk Kim wrote:
>>> On Tuesday 15 March 2011 08:45 pm, Maxim Dounin wrote:
>>>> This isn't really different as long as GENERIC kernel used, as
>>>> GENERIC defines I486_CPU.
>>>
>>> Fixed in r219698, sorry.
>>>
>>> Actually, I think we should remove i486 from GENERIC at some
>>> point. It has too many limitations.  For example, I really love
>>> to implement atomic 64-bit mem read/write using cmpxchg8b (no
>>> 0xf00f joke, please) but I cannot do that cleanly without
>>> removing I486 support or checking cpu_class at run-time. :-(

I'm not sure how f00f applies to cmpxchg8b, but we still maintain
the f00f workaround (perhaps not well enough to keep it actually
working) though there might be more i486's still running FreeBSD
than there are i586's with the f00f bug, since i486's might be
reimplemented into embedded systems but i586's with the f00f bug
might have been replaced and reimplementations of i586's wouldn't
implement the f00f bug.

>> if we drop i486 I think it makes sense to require something that
>> has at least SSE2, thus we can have the same expectations as on
>> amd64.
>>
>> and we can use sse2 unconditionally (str*, mem* etc.)
>
> This is a proof-of-concept patch for sys/x86/isa/clock.c:
>
> http://people.freebsd.org/~jkim/clock.diff

Please include patches in mail.  It is hard to refer to lines in urls.

> You see the complexity, just because I wanted to load 64-bit value
> atomically... :-(

I still think it should use a generation count.  (It can't use a mutex
due to deadlock and other problems.)

% Index: sys/x86/isa/clock.c
% ===================================================================
% --- sys/x86/isa/clock.c	(revision 219697)
% +++ sys/x86/isa/clock.c	(working copy)
% @@ -62,6 +62,7 @@ __FBSDID("$FreeBSD$");
% 
%  #include <machine/clock.h>
%  #include <machine/cpu.h>
% +#include <machine/cputypes.h>
%  #include <machine/intr_machdep.h>
%  #include <machine/ppireg.h>
%  #include <machine/timerreg.h>
% @@ -245,40 +246,85 @@ getit(void)
%  	return ((high << 8) | low);
%  }
% 
% -static __inline void
% -delay_tsc(int n)
% +#ifdef __i386__
% +static __inline uint64_t
% +atomic_fetch_quad_i386(uint64_t *p)
%  {
% -	uint64_t start, end, now;
% +	uint64_t v;
% 
% -	sched_pin();
% -	start = rdtsc();
% -	end = start + (tsc_freq * n) / 1000000;

This reminds me to point out that DELAY() still has the old i386 code
to manually optimize the division corresponding to this, for the i8254
case with n < 256.  Plain i386 is no longer supported, and even i486's
aren't so slow, and newer CPUs have much faster divisions, and anyway,
gcc has been doing similar optimizations automatically for more than
15 (?) years -- it will turn 32-bit division into a multiplication and
shift if possible.  It only does this if this if it is possible to do
exactly, while the manual code doesn't need to be very precise so may
be able to do it when gcc can't.

However, the general case is a 64 by 32 bit division, and gcc still
generates very slow code for this (a __udivdi3 call).  This takes about
65 cycles on an Athlon64.  Since this is below the resolution of DELAY()
on an Athlon64 (even underclocked to not less than 33 MHz), it doesn't
matter.  It might be just enough to worry about on an i486.  But the
optimization belongs in gcc and __udivdi3.  The i8254 code uses the
general case for n >= 256 since even i386/16's can execute __udivdi3
in 256 uS.

% -	do {
% -		cpu_spinwait();
% -		now = rdtsc();
% -	} while (now < end || (now > start && end < start));
% -	sched_unpin();
% +	/* i486 does not support SMP. */
% +	__asm __volatile(
% +	"	pushfl ;			"
% +	"	cli ;				"
% +	"	movl (%1), %%eax ;		"
% +	"	movl 4(%1), %%edx ;		"
% +	"	popfl"
% +	: "=&A" (v) : "r" (p));
% +	return (v);
%  }

Asms in *.c are style bugs.  intr_disable() and intr_restore() exist as
asms in cpufunc.h so that no asm in *.c is needed for cli/sti.

% 
% -static __inline void
% -delay_timecounter(struct timecounter *tc, int n)
% +static __inline uint64_t
% +atomic_fetch_quad_i586(uint64_t *p)
%  {
% -	uint64_t end, now;
% +	uint64_t v;
% +
% +	__asm __volatile(
% +	"	movl %%ebx,%%eax ;		"
% +	"	movl %%ecx,%%edx ;		"
% +	"	" MPLOCKED " cmpxchg8b (%1)"
% +	: "=&A" (v) : "r" (p) : "cc");
% +	return (v);
% +}

This should be in atomic.h, with the runtime feature test there.  atomic.h
still has the compile-time feature test CPU_DISABLE_CMPXCHG which gives
a version of cmpset32 using cli/sti, though it no longer has the i386
ifdef for this.

% +
% +static __inline uint64_t
% +_fetch_frequency(uint64_t *p)
% +{
% +
% +	if (cpu_class == CPUCLASS_486)
% +		return (atomic_fetch_quad_i386(p));
% +	return (atomic_fetch_quad_i586(p));
% +}

cpu_class should never be used.  This should test the cpu feature for
cmpxchg8b.  Athlon64 docs say that cmpxchg8b can give "Invalid opcode",
if its bit is not set in cpu feature, so it is not clear if cmpxchg8b
is available even on all Athlon64's.  You have a problem if any CPU
that supports SMP doesn't support cmpxchg8b.

% +#else
% +#define	_fetch_frequency(p)	(*(p))
% +#endif
% +
% +static __inline int
% +_delay(int n)
% +{
% +	struct timecounter *tc;
% +	uint64_t end, freq, now;
%  	u_int last, mask, u;
% +	int use_tsc;
% 
% -	mask = tc->tc_counter_mask;
% -	last = tc->tc_get_timecount(tc) & mask;
% -	end = tc->tc_frequency * n / 1000000;
% +	tc = timecounter;
% +	freq = _fetch_frequency(&tsc_freq);
% +	use_tsc = tsc_is_invariant && freq != 0;
% +	if (use_tsc) {
% +		mask = ~0u;
% +		sched_pin();
% +		last = rdtsc();
% +	} else {
% +		if (tc->tc_quality <= 0)
% +			return (0);
% +		freq = _fetch_frequency(&tc->tc_frequency);
% +		mask = tc->tc_counter_mask;
% +		last = tc->tc_get_timecount(tc);
% +	}
% +	last &= mask;
% +	end = freq * n / 1000000;
%  	now = 0;
%  	do {
%  		cpu_spinwait();
% -		u = tc->tc_get_timecount(tc) & mask;
% +		u = (use_tsc ? rdtsc() : tc->tc_get_timecount(tc)) & mask;
%  		if (u < last)
%  			now += mask - last + u + 1;
%  		else
%  			now += u - last;
%  		last = u;
%  	} while (now < end);
% +	if (use_tsc)
% +		sched_unpin();
% +	return (1);
%  }
% 
%  /*

This does seem simpler than a generation count, since it doesn't require
modifying places that set the code, but I don't see how it can work
without modifying other places.  There seems to be nothing to prevent
atomic_fetch_quad_i586 reading 2 identical half-initialized values.
Apart from minor races involving the modification running slow than
the cmpxchg8b can see any changes, the modification may be preempted
while half done, giving a very large race window.  When the kernel
was not preemptable, the race was limited to interrupt handlers.  Now
it extends to all threads that don't block on the lock being used by
the sysctl.

But!, when the kernel was not preemptable, there was no CPU running
faster than 4GHz, and tsc_freq was only 32 bits, so there was no
problem with updating it atomically enough (we don't care about using
stale values).  There is probably still no 386 running faster than 4GHz,
and only a few overclocked newer ones running fast than 4HGz.  Thus
the race is mostly in the future, and we don't need these complications
yet for the non-i486 case or probably ever for the i486 case.

The problem can be pushed to much higher than 4 GHz by scaling tsc_freq
a little; make tsc_freq the actual frequency divided by 4, and use
rdtsc() / 4 instead of rdtsc() to reach 16 GHz with a 32-bit tsc_freq.
Only urandom would notice the lost 2 bits, and it wouldn't divide by 4.

% @@ -289,7 +335,6 @@ getit(void)
%  void
%  DELAY(int n)
%  {
% -	struct timecounter *tc;
%  	int delta, prev_tick, tick, ticks_left;
% 
%  #ifdef DELAYDEBUG
% @@ -298,15 +343,8 @@ DELAY(int n)
%  	static int state = 0;
%  #endif
% 
% -	if (tsc_freq != 0) {
% -		delay_tsc(n);
% +	if (_delay(n))
%  		return;
% -	}
% -	tc = timecounter;
% -	if (tc->tc_quality > 0) {
% -		delay_timecounter(tc, n);
% -		return;
% -	}
%  #ifdef DELAYDEBUG
%  	if (state == 0) {
%  		state = 1;

Still have the delay before code to debug it.

Bruce


More information about the svn-src-head mailing list