svn commit: r242835 - head/contrib/llvm/lib/Target/X86
Bruce Evans
brde at optusnet.com.au
Mon Nov 12 11:05:05 UTC 2012
On Mon, 12 Nov 2012, Bruce Evans wrote:
> On Sun, 11 Nov 2012, Dimitry Andric wrote:
>> It works just fine now with clang. For the first example, I get:
>>
>> pushl %ebp
>> movl %esp, %ebp
>> andl $-32, %esp
>>
>> as prolog, and for the second:
>>
>> pushl %ebp
>> movl %esp, %ebp
>> andl $-16, %esp
>
> Good.
>
> The andl executes very fast. Perhaps not as fast as subl on %esp,
> because subl is normal so more likely to be optimized (they nominally
> have the same speeds, but %esp is magic). Unfortunately, it seems to
> be impossible to both align the stack and reserve some space on it in
> 1 instruction -- the andl might not reserve any.
I lost kib's reply to this. He said something agreeeing about %esp
being magic on Intel CPUs starting with PentiumPro.
The following quick test shows no problems on Xeon 5650 (freefall) or
Athlon64:
@ asm(" \n\
@ .globl main \n\
@ main: \n\
@ movl $266681734,%eax \n\
@ # movl $201017002,%eax \n\
@ 1: \n\
@ call foo1 \n\
@ decl %eax \n\
@ jne 1b \n\
@ ret \n\
@ \n\
@ foo1: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo2 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo2: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo3 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo3: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo4 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo4: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo5 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo5: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo6 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo6: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo7 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo7: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo8 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo8: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ # call foo9 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ ");
Build this on an i386 system so that it is 32-bit mode.
This takes 56-57 cycles/iteration on Athlon64 and 50-51 cycles/iteration
on X6560. Changing the andls to subls of 16 doesn't change this.
Removing all the andls and subls doesn't change this on Athlon64, but
on X6560 it is 4-5 cycles faster. This shows that the gcc pessimization
is largest on X6560 :-). Adding "pushl %eax; popl %eax" before the
calls to foo[2-8] adds 35-36 cycles/iteration on Athlon64 but only 6-7
on X6560. I know some Athlons don't optimize pushl/popl well (maybe
when they are close together or near a stack pointer change as here).
Apparently Athlon64 is one such.
Bruce
More information about the svn-src-all
mailing list