svn commit: r213281 - head/lib/libc/amd64/gen

Thu Sep 30 17:33:29 UTC 2010

On Thu, 30 Sep 2010, Dimitry Andric wrote:

> On 2010-09-30 05:46, Bruce Evans wrote:
> ...
>> This file probably shouldn't exist, especially on amd64.  There are 4 or 5
>> versions of ldexp(), and this file implements what seems to be the worst
>> one, even without the bug.
>> ...
>
> The version in libc/gen/ldexp.c is just a copy of msun/src/s_scalbn.c,
> with some things like copysign() directly pasted in.  It even has:
>
> /* @(#)fdlibm.h 5.1 93/09/24 */
>
> at the top.

Bah, I missed this sixth version :-).

>> Testing indicates that the fdlibm C version is 2.5 times faster than the
>> asm versions on amd64 on a core2 (ref9), while on i386 the C version is
>> only 1.5 times faster.  The C code is a bit larger so benefits more from
>> being called from a loop.  The asm code uses a slow i387 instruction, and
>> on i387 it hhs to do expensive moves from xmm registers to i387 ones and
>> back.
>> 
>> Times for 100 million calls:
>> 
>>       amd64 libc ldexp:      3.18 seconds
>>       amd64 libm asm scalbn: 2.96
>>       amd64 libm C scalbn:   1.30
>>       i386  libc ldexp:      3.13
>>       i386  libm asm scalbn: 2.86
>>       i386  libm C scalbn:   2.11
>
> Seeing these results, I propose to just delete
> lib/libc/amd64/gen/ldexp.c and lib/libc/i386/gen/ldexp.c, which will
> cause the amd64 and i386 builds to automatically pick up
> lib/libc/gen/ldexp.c instead, which effectively is the fdlibm
> implementation.  (And no more clang workarounds needed. :)

I like this idea.

Does anyone have ideas for better testing?  The loop also benefits
machines with multiple pipelines and/or out/of order execution.
Especially with the latter I think it is possible for several iterations
to be in progress at once (looks like an average of about 1.5 for
AthlonXP and later in other similar loop benchmarks).  In other
benchmarks I use a volatile variable to be more sure of defeating
unwanted compiler optimizations, but I don't want to enforce serialization
since non-benchmarks don't do that.  In libm functions, the largest
optimizations are from avoiding as internal serialization as much as
possible.  Using the i387 functions tends to defeat this since there is
only 1 ALU for them (unlike for i387 addition, etc.; there are 2 ALUs
for that on AthlonXP and later).  Perhaps the i387 functions will be
relatively faster again someday when there are more ALUs for them and
better microcode in them, but x86 architects apparently consider this
a low priority and/or the microcode is too hard make better than ordinary
instructions.

I think big functions using ordinary instructions are OK if they are
slightly faster than i387 functions, since if they aren't called much
then it doesn't matter and if they are called much then they will stay
cached.  But in they latter case, they will push other code out of caches;
I don't know how to quantify this.

Bruce