amd64 slower than i386 on identical AMD 64 system? / How is hyperthreading handled on amd64?

Sat Mar 18 00:58:06 UTC 2006

On Fri, 17 Mar 2006, JoaoBR wrote:

> On Thursday 16 March 2006 18:30, Bruce Evans wrote:
>> On Thu, 16 Mar 2006, Peter Wemm wrote:
>>> There are a number of weaknesses in the amd64 port too.  In particular,
>>> the math library does not yet use the generally superior SSE2
>>> instructions.  This is a real setback because the ABI uses SSE2
>>> floating point parameter passing.  The effect is that some random libm
>>> function is given a SSE2 register, which we convert to and x87 fp stack
>>> register, do the x87 operation, then convert the x87 stack register
>>> back to a SSE2 register then return the SSE2 result.  This is
>>> especially unfortunate when the native SSE2 instruction that would
>>> operate on the SSE2 registers directly is faster.  But, I don't know
>>> SSE2 nor x87 fpu assembler code very well, so I've done "just enough"
>>> to get things to work.
>>

[The part that I wrote saying that this is not true was clipped.]

> do SSE influence "normal" operations as disk-io, memory access and network ?

Not at all on amd64 systems.  Nontemporal memory accesses can be faster
(or slower), but can and should be done, if at all, without using SSE on
amd64 systems.  (On 32-bit i386 systems with SSE1, they are only available
using MMX registers; on 32-bit i386 systems with SSE2, they are available
using MMX, XMM and 32-bit integer registers but the integer registers aren't
wide enough the the accesses to be as efficient as possible.)

>> My benchmarks in libm indicate that 64-bitness + SSE2 end up being a
>> tiny improvment for single precision and a signifcant improvement for
>> double and long double precision (even for long double where SSE2
>> cannot be used!), but this is only for versions that doesn't use the
>> FPU for transcendental functions, and I think it is mainly from foot
>> shooting in the 32-bit versions.  The improvment in double precision
>> is needed to be competitive with the hardware transcendental functions,
>> and the foot shooting is from heavy use of the GET/SET macros -- these
>> macros force things to memory and thus tend to cause pipeline stalls.
>>
>
> sorry, would you mind to say what do you mean with "foot shooting" here?

"Foot shooting" here means using a method that would be hard to unimprove
on by intentionally choosing the worst method from a range of not obviously
wrong methods.

Bruce