svn commit: r274489 - in head/sys/amd64: amd64 include

Mon Nov 24 05:34:58 UTC 2014

On Sun, 23 Nov 2014, David Chisnall wrote:

> On 21 Nov 2014, at 23:26, Scott Long <scott4long at yahoo.com> wrote:
>
>> That’s a good question to look further into.  I didn’t see any measurable differences with this change.  I think that the cost of the function call itself masks the cost of a few extra instructions, but I didn’t test with switching it on/off for the entire kernel
> 
> [ Note: The following is not specific to the kernel ]
>
> The overhead for preserving / omitting the frame pointer is decidedly nonlinear.  On a modern superscalar processor, it will usually be effectively zero, right up until the point that it pushes something out of the instruction cache on a hot path, at which point it jumps to 20-50%, depending on the workload.

It seems to work much the same as padding with nops for that.  I get the
following times for:

X int x;
X 
X asm("			\n\
X .p2align 6		\n\
X test:			\n\
X 	# pushl %ebp	\n\
X 	# movl %esp,%ebp	\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	nop		\n\
X 	# popl %ebp	\n\
X 	ret		\n\
X ");
X 
X main()
X {
X 	int i;
X 
X 	for (i = 0; i < 201000000; i++)
X 		test();
X }

on an old A64 at 2.01GHz in 32-bit mode:

- above code   (13 bytes of instruction prefetch needed in <test>): 7 cycles
- change 2 nops to pushl %ebp; popl% ebp (same ifetch size):        7 cycles
- change 4 nops to pushl/movl/popl (same ifetch size):              7 cycles
- change 4 nops to 2 * pushl/popl (same ifetch size):               8 cycles
- add 1-2 nops (14-15 bytes...):                                    8 cycles
- add 3 nops (16 bytes...):                                        10 cycles

So the cost is indeed 20-50% (actually 3/7 = 43%) in some cases, but only
in weird cases.  You just have to pack about 3 useful instructions (3 ~=
numver of independent pipelines) together with the frame pointer instructions
in the first 13 bytes of every function, or maybe move the frame pointer
instructions later (only traps including NMIs would notice if they are not
done as soon as possible, provided they are done before function calls).

OTOH, not using a frame pointer costs 1 byte per stack accesses.  This
might bust the icache anywhere in the function, but probably doesn't.
Busting is more likely at the beginning of the function where it does
a bunch of loads of args or a bunch of initializations.  At least gcc
like to generate code like:

 	movl	$0,N(%esp)	# 7 bytes
 	movl	$0,N+4(%esp)
 	...

for initializations, even when the initializations are not at the start
of the function in the source code.  7 bytes is a large x86 instruction,
and just 3 of them may bust the ifetch.

> The performance difference was more pronounced on i386, where having an extra GPR for the register allocator to use could make a 10-20% performance difference on some fairly common code (the two big performance wins for x86-64 over IA32 were the increase in number of GPRs and an FPU ISA that wasn't batshit insane).

No, these only make small differences on modern superscalar processors.
More than explicit 8 GPR or FPU registers are very rarely needed.  The
FPU ISA is not bad (it is just a enhanced stack ISA), any the ISA makes
little difference anyway.  Any inefficiencies in the ISA are hidden in
pipelines provided the ifetcher can keep up and register renaming doesn't
break down.  Managing the FPU stack is painful in asm but easy for compilers.

> For ISAs with more GPRs, that's less of an issue, although after inlining being able to use %rbp as a GPR can sometimes make a noticeable difference in performance.  In particular, as %rpb is callee-save, it's very useful to be able to use it in non-leaf functions.

The limited number of registers works a bit like the limited ifetch at the 
beginning of a function.  Very occasionally, having 1 extra register or being
1 byte shorter makes a signficant difference difference.

ifetch seems to be easier to optimize than a limited number of registers, so
my results above probably only apply to the CPU tested.  Even an A64 can
execute 2 instances of a function concurrently even when the function is
called sequentially.  It is a special case of branch prediction to fetch
from the branch target in advance.  The amount of prefetch may be limited
but it it is always possible to cache the results of prefetching better.

Bruce