svn commit: r242835 - head/contrib/llvm/lib/Target/X86

Mon Nov 12 11:05:05 UTC 2012

On Mon, 12 Nov 2012, Bruce Evans wrote:

> On Sun, 11 Nov 2012, Dimitry Andric wrote:

>> It works just fine now with clang.  For the first example, I get:
>> 
>>        pushl   %ebp
>>        movl    %esp, %ebp
>>        andl    $-32, %esp
>> 
>> as prolog, and for the second:
>> 
>>        pushl   %ebp
>>        movl    %esp, %ebp
>>        andl    $-16, %esp
>
> Good.
>
> The andl executes very fast.  Perhaps not as fast as subl on %esp,
> because subl is normal so more likely to be optimized (they nominally
> have the same speeds, but %esp is magic).  Unfortunately, it seems to
> be impossible to both align the stack and reserve some space on it in
> 1 instruction -- the andl might not reserve any.

I lost kib's reply to this.  He said something agreeeing about %esp
being magic on Intel CPUs starting with PentiumPro.

The following quick test shows no problems on Xeon 5650 (freefall) or
Athlon64:

@ asm("					\n\
@ .globl main				\n\
@ main:					\n\
@ 	movl	$266681734,%eax		\n\
@ 	# movl	$201017002,%eax		\n\
@ 1:					\n\
@ 	call	foo1			\n\
@ 	decl	%eax			\n\
@ 	jne	1b			\n\
@ 	ret				\n\
@ 					\n\
@ foo1:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo2			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo2:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo3			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo3:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo4			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo4:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo5			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo5:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo6			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo6:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo7			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo7:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo8			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo8:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	# call	foo9			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ ");

Build this on an i386 system so that it is 32-bit mode.

This takes 56-57 cycles/iteration on Athlon64 and 50-51 cycles/iteration
on X6560.  Changing the andls to subls of 16 doesn't change this.
Removing all the andls and subls doesn't change this on Athlon64, but
on X6560 it is 4-5 cycles faster.  This shows that the gcc pessimization
is largest on X6560 :-).  Adding "pushl %eax; popl %eax" before the
calls to foo[2-8] adds 35-36 cycles/iteration on Athlon64 but only 6-7
on X6560.  I know some Athlons don't optimize pushl/popl well (maybe
when they are close together or near a stack pointer change as here).
Apparently Athlon64 is one such.

Bruce