amd64 slower than i386 on identical AMD 64 system? / How is hyperthreading handled on amd64?

Thu Mar 16 21:30:29 UTC 2006

On Thu, 16 Mar 2006, Peter Wemm wrote:

> There are a number of weaknesses in the amd64 port too.  In particular,
> the math library does not yet use the generally superior SSE2
> instructions.  This is a real setback because the ABI uses SSE2
> floating point parameter passing.  The effect is that some random libm
> function is given a SSE2 register, which we convert to and x87 fp stack
> register, do the x87 operation, then convert the x87 stack register
> back to a SSE2 register then return the SSE2 result.  This is
> especially unfortunate when the native SSE2 instruction that would
> operate on the SSE2 registers directly is faster.  But, I don't know
> SSE2 nor x87 fpu assembler code very well, so I've done "just enough"
> to get things to work.

Actually, the math library just uses SSE2 (except for long doubles,
when SSE2 can't be used), and anyway SSE2 is only slightly faster than
the FPU for code with scalar interfaces like the math library.  The
"just uses" part is due to gcc.  It just uses SSE2 instructions by
default on amd64.  SSE2 is only slightly faster because most scalar
floating point operations have the same execution latency and throughput
as for the FPU.  SSE2's advantage on scalar code comes mainly from
having more directly accessible registers (16 xmm registers instead
of 8 (or sometimes only 1 at the top of the stack directly accessible)
FPU registers on amd64).  This advantage is often small because the
extra moves to access registers can be done in parallel with other
operations.  Note that this parallelism often occurs automatically
due to (out of order instruction) scheduling in the CPU.  Execution
latency is very large (e.g., 4 cycles for each of add and mul) compared
with execution throughput (e.g., 1 cycle for an add and a mul) so there
are usually plenty of spare pipeline slots for executing the moves in
parallel.

My benchmarks in libm indicate that 64-bitness + SSE2 end up being a
tiny improvment for single precision and a signifcant improvement for
double and long double precision (even for long double where SSE2
cannot be used!), but this is only for versions that doesn't use the
FPU for transcendental functions, and I think it is mainly from foot
shooting in the 32-bit versions.  The improvment in double precision
is needed to be competitive with the hardware transcendental functions,
and the foot shooting is from heavy use of the GET/SET macros -- these
macros force things to memory and thus tend to cause pipeline stalls.

Bruce