svn commit: r223211 - head/sys/x86/x86
Jung-uk Kim
jkim at FreeBSD.org
Mon Jun 20 23:41:09 UTC 2011
On Saturday 18 June 2011 07:08 am, Bruce Evans wrote:
> On Fri, 17 Jun 2011, Jung-uk Kim wrote:
> > Log:
> > Teach the compiler how to shift TSC value efficiently. As noted
> > in r220631, some times compiler inserts redundant instructions to
> > preserve unused upper 32 bits even when it is casted to a 32-bit
> > value. Unfortunately, it seems the problem becomes more serious
> > when it is shifted, especially on amd64.
>
> Er, I tried to point out how to optimize this code before (but
> didn't reply to your reply), and it's not by using more asm.
>
> > Modified: head/sys/x86/x86/tsc.c
> > =================================================================
> >============= --- head/sys/x86/x86/tsc.c Fri Jun 17 21:31:13
> > 2011 (r223210) +++ head/sys/x86/x86/tsc.c Fri Jun 17 21:41:06
> > 2011 (r223211) @@ -461,7 +461,7 @@ init_TSC_tc(void)
> > tsc_timecounter.tc_quality = 1000;
> >
> > init:
> > - for (shift = 0; shift < 32 && (tsc_freq >> shift) > max_freq;
> > shift++) + for (shift = 0; shift < 31 && (tsc_freq >> shift) >
> > max_freq; shift++) ;
> > if (shift > 0) {
> > tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
>
> shift == 32 (or even shift == 31) is unreachable. A shift of 31
> will shift 2GHz down to 1 Hz, or support physically impossible
> frequencies like 2**33 GHz. OTOH, shifts of up to 63 are supported
> by the slow gcc code.
It is unreachable but it is a safety belt. Please see below.
> > @@ -579,6 +579,9 @@ tsc_get_timecount(struct timecounter *tc
> > static u_int
> > tsc_get_timecount_low(struct timecounter *tc)
> > {
> > + uint32_t rv;
> >
> > - return (rdtsc() >> (int)(intptr_t)tc->tc_priv);
> > + __asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> > + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>
> Lexical style bug (indentation of second line of the asm).
I followed sys/amd64/include/atomic.h style. OTOH,
sys/amd64/include/cpufunc.h has little different style. I wasn't
sure what's correct because style(9) does not mention anything
specific to inline assembly. :-(
> > + return (rv);
> > }
>
> Just return the shift of the low 32 bits (and change
> tc_counter_mask to match) like I said. This loses only the
> accidental ability for the timecounter to work for more than a few
> seconds when interrupts are stopped by something like ddb, since
> any shift count that loses too many of the low 32 bits will not
> work for other reasons. For example, suppose that the TSC
> frequency is 8G-1Hz, which is unavailable except possible in
> research labs. This must be shifted by 1 to fit in 32 bits. If we
> use only the low 32 bits, then we end up with only 31 significant
> bits and tsc_get_timecount_low() wraps after ~2 seconds instead of
> after the best possible for this shift of ~4 seconds. If we shift
> by 7 more, as we do in the SMP case, then if we start with 32 bits
> then we end up with 24 bits, but the wrap still takes 2 seconds; if
> we start with 64 bits then we end up with 32 bits and the wrap
> takes 4*2**7 = 512 seconds. But wrap times longer than 1/HZ times
> a few are not needed. 2 seconds is already at least 100 or 1000
> times longer than needed, depending on HZ. The case where the
> unscaled frequency is 4G-1Hz and !SMP gives a shift count of 0 and
> a wrap time of ~4 seconds. Whatever is done to make that case work
> (say, not allowing a fully tickless kernel with HZ = 0), works
> almost as well up to an unscaled frequency of 8GHz which is still
> far off.
>
> No one will notice these micro-optimizations, but I agree that the
> redundant instructions are ugly. I get the following on i386 for
> the original version with an old source tree:
>
> % #APP
> % rdtsc
> % #NO_APP
> % movl 8(%ebp), %ecx
> % movl 28(%ecx), %ecx
> % shrdl %edx, %eax
> % shrl %cl, %edx
> % testb $32, %cl
> % je .L3
> % movl %edx, %eax
> % xorl %edx, %edx
> % .L3:
>
> The last 4 instructions are not redundant, but are needed to
> support shifts of up to 63 (maybe 64).
They are not redundant but I wanted to get rid of the check entirely.
That's why 32 -> 31 change was made in the first place.
FYI, with amd64, we get this from the old code:
rdtsc
movzbl 48(%rdi), %ecx
salq $32, %rdx
mov %eax, %eax
orq %rax, %rdx
shrq %cl, %rdx
movl %edx, %eax
ret
This is slightly better than i386 version, i.e., no conditional jump,
but it is still ugly, however.
> I tried masking the shift count with 0x1f so that the shift count
> is known to be < 32, this just gave an extra instruction for the
> masking.
Yup.
> It's even worse with rdtsc() converted to u_int first like I want:
>
> % movl %ebx, (%esp)
> % movl %esi, 4(%esp)
> % #APP
> % rdtsc
> % #NO_APP
> % movl %eax, %ebx
> % movl 8(%ebp), %eax
> % movl 4(%esp), %esi
> % movl 28(%eax), %ecx
> % movl %ebx, %eax
> % movl (%esp), %ebx
> % # some frame pointer epilogue reordered here
> % shrl %cl, %eax
>
> The second case may be what you already fixed on amd64 (only?) --
> use rdtsc32() instead of (u_int)rdtsc().
>
> I've always thought that the dynamic shift is overengineered, and
> now like it even less. The following is efficent and works well
> enough in all currently physically possible cases:
>
> % /*
> % * Don't really need a separate one for `low', but now it costs
> less % * (1 shift instruction at runtime and some space). Must
> change % * tc_counter_mask to match.
> % */
> % u_int
> % tsc_get_timecount_low(struct timecounter *tc)
> % {
> % #ifdef SMP
> % /*
> % * Works up to 1024 GHz, assuming that nontemporalness scales
> with % * freq. I think 8 is too many. But now do extra for SMP
> indep. % * of freq.
> % */
> % return (((u_int)rdtsc()) >> 8); /* gens rdtsc; shrl $8,%eax */
> % #else
> % /* Works up to 8 GHz. */
> % return (((u_int)rdtsc()) >> 1); /* gens rdtsc; shrl %eax */
> % #endif
> % }
Dynamic shift may be overengineered but it is useful. For example,
r223211 give us fairly good optimization on amd64:
movq 48(%rdi), %rcx
rdtsc
shrd %cl, %edx, %eax
ret
i386 isn't too impressive, though:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
movl 28(%eax), %ecx
rdtsc
shrd %cl, %edx, %eax
popl %ebp
ret
I just couldn't convince GCC to generate code like this without inline
asm, unfortunately. :-(
Jung-uk Kim
More information about the svn-src-head
mailing list