Gcc46 and 128 Bit Floating Point

Wed Feb 29 10:16:07 UTC 2012

On Tue, 28 Feb 2012, Thomas D. Dean wrote:

> On 02/17/12 13:03, Thomas D. Dean wrote:
> I have been reading the Core-i7 developers manual and looking at libm. I have 
> been trying to shoe horn some calculations between the sizes of fpu 
> instructions and libgmp.
>
> I think there is little support for 128-bit floating point in the Core-i7 
> 3930K CPU.

That is true.  libm doesn't try to support it at all, except on sparc64,
though most of it would work (as for sparc64) with correct headers.
gcc46's libraries might work, but I would expect problems outside of
libm, starting with printf.

But why would you want it?  It is essentially unusable on sparc64,
since it is several thousand times slower than 80-bit floating point
on i386.  At equal CPU clock speeds, it is only about 1000 times slower.
Most of the factors of 10 are due to fundamental slowness of multi-
word artithmetic in software and the soft-float implementations not
being very good (I only tested with the old NetBSD/4.4BSD-derived one.
This has been replaced by the Hauser one, which has good chances for
being worse due to its greater generality and correctness, but the old
one has a lot of slop to improve).  A modern x86 is much faster than
an old sparc64, giving about another factor of 10.  64-bit operations
are only about this 10 times slower (or more like 3 times slower at
equal CPU clock speeds) on an old sparc64 as on a not-so-modern core2
x86.  The gnu libraries might be better.  So you could hope for only
a factor of 100 slowdown on scalar code.  But modern x86's can also
do vector code, and thus be up to 8 times faster for 32-bit floating
point with AVX.  Really good multi-word libraries might be able to
exploit some vector operations, but I think multi-word operations are
too seial in nature to get much parallelism with them.

> The code which uses __float128 implements functions in software and use the 
> 80-bit fpu instructions to assist.
>
> I believe there is some speed improvement with the 128-bit registers. But, I 
> can find no floating point instructions that operate on 128-bit floating 
> point, like there is for 80-bit.

AVX and below have none for 128 bits.  They only have 32-bit and 64-bit
ones done in parallel (4 32-bit ones or 2 64-bit ones with SSE, or twice
that with AVX).  Emulating 128-bit ones in software then takes 10-1000
times as long as the hardware 64-bit or 80-bit ones.  (80-bit ones on
x86 generally have identical speeds to 64-bit and 32-bit ones, but are
not so parallelizable).

> The bottom line seems to be little gain in floating point operations with the 
> core-i7 CPU.

Expect a loss in speed of up to 1000 times for 128 bits.

Modern x86 wins mainly be better parallelism and scheduling.  Other things
haven't changed much since Athlon-XP in 2001:
- the clock speed got stuck at 2-4GHz
- instructions issued per cycles got stuck at about 3 (2 FP adds or muls,
   plus a useful integer operation and/or load/store).  Maybe slightly
   more with i7.  But parallelism has increased by up to a factor of 4 --
   these instructions can now be 4 64-bit ones in a vector every cycle
   instead of 2 64-bit ones in a vector every 2 cycles
- latency for add/mul decreased from 4 cycles to 3 or maybe 2.

> #include <quadmath.h>
> #include <stdio.h>
> int main() {
>  char buf[128];
>  __float128 x = sqrtq(2.0Q);
>  quadmath_snprintf(buf, sizeof buf, "%.45Qf",x);
>  printf("sin(%s) = ",buf);
>  quadmath_snprintf(buf, sizeof buf, "%.45Qf",sinq(x));
>  printf("%s\n",buf);
>  return 0;
> }
>
> gcc46 math.c -o math /usr/local/lib/gcc46/libquadmath.a /usr/lib/libm.a

I don't know the gcc library.  The above has a chane or working, but
it's painful to write when you can't use ordinary printf() directly.

> Looking at the output of objdump -d math shows software implementation of 
> sqrtq() and sinq().  gcc46 does use the fsqrt instruction but not fsin.

It doesn't use fsqrt according to Steve Kargl.  Neither fsqrt nor fsin
would work and neither should be used ever, since they are old, slow
80-bit i387 instructions which are apparently emulated in slow microcode
on all modern x86.  Software can beat them by a little for speed up
to double precision and by a lot for accuracy in all precision.  Software
has a harder time being fast on them for 80 and 128 bits, even if the
basic operations are fast.  But 80-bit hardware versions of them are no
help for the 128-bit software versions.

Bruce