svn commit: r223211 - head/sys/x86/x86

Mon Jun 20 23:41:09 UTC 2011

On Saturday 18 June 2011 07:08 am, Bruce Evans wrote:
> On Fri, 17 Jun 2011, Jung-uk Kim wrote:
> > Log:
> >  Teach the compiler how to shift TSC value efficiently.  As noted
> > in r220631, some times compiler inserts redundant instructions to
> > preserve unused upper 32 bits even when it is casted to a 32-bit
> > value.  Unfortunately, it seems the problem becomes more serious
> > when it is shifted, especially on amd64.
>
> Er, I tried to point out how to optimize this code before (but
> didn't reply to your reply), and it's not by using more asm.
>
> > Modified: head/sys/x86/x86/tsc.c
> > =================================================================
> >============= --- head/sys/x86/x86/tsc.c	Fri Jun 17 21:31:13
> > 2011	(r223210) +++ head/sys/x86/x86/tsc.c	Fri Jun 17 21:41:06
> > 2011	(r223211) @@ -461,7 +461,7 @@ init_TSC_tc(void)
> > 		tsc_timecounter.tc_quality = 1000;
> >
> > init:
> > -	for (shift = 0; shift < 32 && (tsc_freq >> shift) > max_freq;
> > shift++) +	for (shift = 0; shift < 31 && (tsc_freq >> shift) >
> > max_freq; shift++) ;
> > 	if (shift > 0) {
> > 		tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
>
> shift == 32 (or even shift == 31) is unreachable.  A shift of 31
> will shift 2GHz down to 1 Hz, or support physically impossible
> frequencies like 2**33 GHz.  OTOH, shifts of up to 63 are supported
> by the slow gcc code.

It is unreachable but it is a safety belt.  Please see below.

> > @@ -579,6 +579,9 @@ tsc_get_timecount(struct timecounter *tc
> > static u_int
> > tsc_get_timecount_low(struct timecounter *tc)
> > {
> > +	uint32_t rv;
> >
> > -	return (rdtsc() >> (int)(intptr_t)tc->tc_priv);
> > +	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> > +	: "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>
> Lexical style bug (indentation of second line of the asm).

I followed sys/amd64/include/atomic.h style.  OTOH, 
sys/amd64/include/cpufunc.h has little different style.  I wasn't 
sure what's correct because style(9) does not mention anything 
specific to inline assembly. :-(

> > +	return (rv);
> > }
>
> Just return the shift of the low 32 bits (and change
> tc_counter_mask to match) like I said.  This loses only the
> accidental ability for the timecounter to work for more than a few
> seconds when interrupts are stopped by something like ddb, since
> any shift count that loses too many of the low 32 bits will not
> work for other reasons.  For example, suppose that the TSC
> frequency is 8G-1Hz, which is unavailable except possible in
> research labs.  This must be shifted by 1 to fit in 32 bits.  If we
> use only the low 32 bits, then we end up with only 31 significant
> bits and tsc_get_timecount_low() wraps after ~2 seconds instead of
> after the best possible for this shift of ~4 seconds.  If we shift
> by 7 more, as we do in the SMP case, then if we start with 32 bits
> then we end up with 24 bits, but the wrap still takes 2 seconds; if
> we start with 64 bits then we end up with 32 bits and the wrap
> takes 4*2**7 = 512 seconds.  But wrap times longer than 1/HZ times
> a few are not needed.  2 seconds is already at least 100 or 1000
> times longer than needed, depending on HZ.  The case where the
> unscaled frequency is 4G-1Hz and !SMP gives a shift count of 0 and
> a wrap time of ~4 seconds.  Whatever is done to make that case work
> (say, not allowing a fully tickless kernel with HZ = 0), works
> almost as well up to an unscaled frequency of 8GHz which is still
> far off.
>
> No one will notice these micro-optimizations, but I agree that the
> redundant instructions are ugly.  I get the following on i386 for
> the original version with an old source tree:
>
> % #APP
> % 	rdtsc
> % #NO_APP
> % 	movl	8(%ebp), %ecx
> % 	movl	28(%ecx), %ecx
> % 	shrdl	%edx, %eax
> % 	shrl	%cl, %edx
> % 	testb	$32, %cl
> % 	je	.L3
> % 	movl	%edx, %eax
> % 	xorl	%edx, %edx
> % .L3:
>
> The last 4 instructions are not redundant, but are needed to
> support shifts of up to 63 (maybe 64).

They are not redundant but I wanted to get rid of the check entirely.  
That's why 32 -> 31 change was made in the first place.
FYI, with amd64, we get this from the old code:

        rdtsc
        movzbl  48(%rdi), %ecx
        salq    $32, %rdx
        mov     %eax, %eax
        orq     %rax, %rdx
        shrq    %cl, %rdx
        movl    %edx, %eax
        ret

This is slightly better than i386 version, i.e., no conditional jump, 
but it is still ugly, however.

> I tried masking the shift count with 0x1f so that the shift count
> is known to be < 32, this just gave an extra instruction for the
> masking.

Yup.

> It's even worse with rdtsc() converted to u_int first like I want:
>
> % 	movl	%ebx, (%esp)
> % 	movl	%esi, 4(%esp)
> % #APP
> % 	rdtsc
> % #NO_APP
> % 	movl	%eax, %ebx
> % 	movl	8(%ebp), %eax
> % 	movl	4(%esp), %esi
> % 	movl	28(%eax), %ecx
> % 	movl	%ebx, %eax
> % 	movl	(%esp), %ebx
> % 	# some frame pointer epilogue reordered here
> % 	shrl	%cl, %eax
>
> The second case may be what you already fixed on amd64 (only?) --
> use rdtsc32() instead of (u_int)rdtsc().
>
> I've always thought that the dynamic shift is overengineered, and
> now like it even less.  The following is efficent and works well
> enough in all currently physically possible cases:
>
> % /*
> %  * Don't really need a separate one for `low', but now it costs
> less %  * (1 shift instruction at runtime and some space).  Must
> change %  * tc_counter_mask to match.
> %  */
> % u_int
> % tsc_get_timecount_low(struct timecounter *tc)
> % {
> % #ifdef SMP
> % 	/*
> %	 * Works up to 1024 GHz, assuming that nontemporalness scales
> with %	 * freq.  I think 8 is too many.  But now do extra for SMP
> indep. % 	 * of freq.
> %	 */
> % 	return (((u_int)rdtsc()) >> 8);	/* gens rdtsc; shrl $8,%eax */
> % #else
> % 	/* Works up to 8 GHz. */
> % 	return (((u_int)rdtsc()) >> 1);	/* gens rdtsc; shrl %eax */
> % #endif
> % }

Dynamic shift may be overengineered but it is useful.  For example, 
r223211 give us fairly good optimization on amd64:

	movq	48(%rdi), %rcx
	rdtsc
	shrd	%cl, %edx, %eax 
	ret

i386 isn't too impressive, though:

	pushl   %ebp
	movl    %esp, %ebp
	movl    8(%ebp), %eax
	movl    28(%eax), %ecx
        rdtsc
	shrd	%cl, %edx, %eax  
	popl	%ebp
	ret

I just couldn't convince GCC to generate code like this without inline 
asm, unfortunately. :-(

Jung-uk Kim