extra-precision bugs in clang on i386 even with __SSE*_MATH__

Wed May 22 05:03:04 UTC 2013

clang with certain march= in CFLAGS uses SSE for double and/or float
operations even on i386 although the ABI doesn't really allow this,
and sets __SSE2_MATH__ and/or __SSE_MATH__ to indicate this.  It is
well known that this breaks the definitions of float_t, double_t and
FLT_EVAL_METHOD, because FreeBSD headers haven't been updated to support
clang; in particular they know nothing of __SSE*_MATH__.

I recently noticed a related bug that is not due to missing support
in FreeBSD.  Like gcc, clang doesn't discard extra precision on casts or
assignments, as is fuzzily specified in C90 and explicitly specified in
C99).  Using SSE breaks extra precision by preventing it from
occurring in most cases, but it can still occur if a double or float
is somehow allocated in an i387 register.  This occurs mainly for
function calls, because the ABI requires FP functions to return results
in extra-precision registers, and it is normal and normally good to
actually return extra-precision results.  clang then fails to clip the
extra precision even when explicitly requested to do so using a cast
or assignment.  E.g.

@ #include <math.h>
@ #include <stdio.h>
@ 
@ static double tiny = 0x1p-31;
@ 
@ int
@ main(void)
@ {
@ 	printf("tiny = %a\n", tiny);
@ 	printf("cos(tiny) may have extra precision, and is %La\n",
@ 	    (long double)cos(tiny));
@ 	printf("(double)cos(tiny) should be 1, but is %La\n",
@ 	    (long double)(double)cos(tiny));
@ 	return (0);
@ }

This works correctly on amd64 (amd64 happens not to use i386 fcos, but
even if it did then the extra precision in the fcos result would be
clipped by the ABI returning the result in an xmm regiser), but on i386
it gives:

@ tiny = 0x1p-31
@ cos(tiny) may have extra precision, and is 0x1.fffffffffffffffcp-1
@ (double)cos(tiny) should be 1, but is 0x1.fffffffffffffffcp-1

Both results are just (1 - (long double)tiny * tiny / 2) rounded to
long double precision.

It is normally good for cos(tiny) to have extra precision, so that you
can use it in double precision expressions that produce extra-precision
results according to FLT_EVAL_METHOD.  Without this, you can't write
even the squaring operation x*x as a function that returns x*x, without
the function being unnecessarily less accurate and giving different
results than the expression x*x or a function-like macro that gives
x*x.  Similarly for all functions that return double or float.

However, sometimes you need to clip the extra precision, and compiler
bugs make this difficult to do in a standard way.  Normally the compiler
bugs are features because the clipping operation is slow, and naive
code that assigns intermediate results to double and float variables
will lose precision and speed unnecessarily (non-naive code must use
float_t and double_t for almost everything and clip these only when
necessary.  This works well except across function calls).  FreeBSD
libm uses the STRICT_ASSIGN() macro to do clipping.  It only does
explicit clipping if FLT_EVAL_METHOD != 0 (or !__GNUC__).  The present
bug prevents fixing FLT_EVAL_METHOD to be 0 with clang using SSE*,
since that would break STRICT_ASSIGN() in the case where the rvalue
has extra precision because it is allocated in a register.  It is
difficult to make STRICT_ASSIGN() depend on the register allocation
even in a nonstandard way.

C11 breaks this area even more.  It specifies that extra precision is
always clipped on return, at least in C functions.  This breaks
intentionally returning extra precision for accuracy, and more seriously
it breaks efficiency by requiring the slow clipping operation on every
return (on x86, the clipping operation takes about as long as 2
serially-dependent addition operations and stalls pipelines due to
its serial dependencies).  Buggy compilers should ignore this buggy
specification like they already ignore the buggy specification to
clip precision on casts and assignments.  Clipping on every assignment
is even more inefficient because assignments are more common than returns.
However, there needs to be a standard way to clip (just casting?).

I don't know if this applies to library functions implemented in another
language.  If it does, then all libraries that use raw hardware functions
like i387 fcos are non-conformant, and must be weakened for conformance.

C99's bugs in this area give the bizarre requirement that extra precision
must be clipped on return if and only if it was gained intentionally
(literally, if and only if it is explicit, but implementations that
generate it intentionally will probably have it explicitly).  The C11
regression is a apparently a reaction to this bizarreness.

Examples:
   1. double sq(double x) { return (x * x); }
   2. double sq(double x) { return ((long double)x * x); }
   3. double sq(double x) { return ((double_t)x * x); }

(1) may evaluate x*x in implicit extra precision according FLT_EVAL_METHOD.
If it did, then it should return the extra precision.  C99 permits this,
but C11 doesn't.

(2) evaluates x*x in explicit higher precision (which is strictly higher
iff long double is strictly larger).  Neither C99 nor C11 permit returning
the extra precision, if any.

(3) evaluates x*x in explicit minimally-for-efficiency extra precision.
Neither C99 nor C11 permit returning the extra precision, if any.  So
the most careful code that uses double_t to keep as much extra precision
as possible (subject to efficiency) is required to waste time to kill this
care at function call boundaries.

I use a fake null cast of double_t to double (using null asm) to avoid
the losses in (3).  This works with gcc but not with clang.  clang
seemed to implement part of the C11 breakage and added clipping code
even for converting a type to the same type.
     In numeric terms, for a certain version of logf on certain hardware,
     the clipping costs were approximately:
     - gcc version on i386 with no clipping: ~35 cycles/call
     - clang version on i386 with clipping: ~35+8 cycles/call
     - gcc and clang versions on amd64: ~35+8 cycles/call
     amd64 was slower because the ABI enforces clipping.  The cycle counts
     were not precisely the same for the clipping versions on the different
     arches, but the clipping cost is 8 cycles on both and this dominates
     all the other differences (using SSE and a cleaner ABI on amd64 makes
     remarkbly little difference).
Investigating the clang difference now showed futher bizarreness:
- neither clang nor gcc do the clipping required by C99 for (2) or (3).
   This is apparently because thet know that upcasting x doesn't change
   it but don't know that upcasting x may change the result of the
   expression.
- however, both gcc and clang do the clipping for:
   (3'). double prod(double x, long double y) { return (x * y); }

Here is my asm hack that doesn't work for clang:

@ /*
@  * Convert a float_t to a float without losing extra precision and/or
@  * wasting time, if possible.  This is a sort of inverse for
@  * STRICT_ASSIGN().  It is typically used to avoid C's bizarre return
@  * semantics, which require that extra precision intentionally obtained by
@  * using float_t within the function must be lost on return, while extra
@  * precision accidentally obtained by using float within the function may
@  * be kept on return.
@  */
@ static inline float
@ hackfloat_t(__float_t ft)
@ {
@ #if defined(__i386__) && !defined(__SSE_MATH__)
@ 	float f;
@ 
@ 	__asm("" : "=t" (f) : "0" (ft));
@ 	return (f);
@ #else
@ 	return (ft);
@ #endif
@ }
@ 
@ /* Similarly for double_t. */
@ static inline double
@ hackdouble_t(__double_t dt)
@ {
@ #if defined(__i386__) && !defined(__SSE2_MATH__)
@ 	double d;
@ 
@ 	__asm("" : "=t" (d) : "0" (dt));
@ 	return (d);
@ #else
@ 	return (dt);
@ #endif
@ }
@

Changing (x * y) to hackdouble_t(x * y) fixes the lost accuracy and speed
only for gcc.

Bruce