Use of C99 extra long double math functions after r236148

Sun Aug 12 23:00:29 UTC 2012

On Sat, 21 Jul 2012, Peter Jeremy wrote:

> Hi Bruce or das@ or Steve,
>
> I have a question on the following code from s_ccosh.c:
> %        /*
> %         * cosh(NaN + I NaN)  = d(NaN) + I d(NaN).
> %         *
> %         * cosh(NaN +- I Inf) = d(NaN) + I d(NaN).
> %         * Optionally raises the invalid floating-point exception.
> %         * Choice = raise.
> %         *
> %         * cosh(NaN + I y)    = d(NaN) + I d(NaN).
> %         * Optionally raises the invalid floating-point exception for finite
> %         * nonzero y.  Choice = don't raise (except for signaling NaNs).
> %         */
> %        return (cpack((x * x) * (y - y), (x + x) * (y - y)));
>
> x is always NaN so the real part presumably just needs to be quietened
> before returning - ie (x + x) would seem to be sufficient.  Why does
> the code use ((x * x) * (y - y))?

The more complex expression is to handle more cases with the same code
(mainly so as to avoid extra tests and branches for the classification.
For exceptional cases, we don't care much if the expressions in the
return statements are slower because they are more complex so as to be
general, or if they are simpler and faster but larger because the
classification is finer).  We could try to avoid extra tests and branches
for the usual cases by first doing a coarse classification and then a fine
classification of the exceptional sub-cases.  This would probably be
efficient (if the CPU has reasonable branch prediction), but it will give
larger code.

Such expressions are complicated mainly to get the signs and NaN mixing
right in all cases.

Quoting this line again:
> %        return (cpack((x * x) * (y - y), (x + x) * (y - y)));

> y has no restriction on its value so an arithmetic operation with x is
> a good way to convert it to a NaN.  Wouldn't (y + x) be sufficient?

Remember that cos[h]() is even and sin[h]() is odd.  (x * x) is to get
an even sign for cosh[h]() and (x + x) is to get an odd sign for sin[h]().
After undoing this magic, the expression becomes:

> %        return (cpack(cosh(x) * (y - y), sinh(x) * (y - y)));

and it remains to explain the (y - y) terms.
   (It would probably work to use this expression.  cosh() and sinh()
   should get the signs and NaNs right.  We optimize away these calls
   because we can.  This depends on knowing that x is Inf or NaN.

   Oops, here x is only NaN, so there is more to explain:
   - in other return statements, x can be either Inf or NaN.  We copy
     the magic expressions for simplicity.
   - signs are not very important for NaNs, but I like to make them
     consistent so that my test programs don't spew errors for unimportant
     inconsistencies (it is easier to make the inconsistencies not occur than
     to automatically decide if they are related to bugs.  das@ doesn't
     like spending any time on this detail).
   - using the magic expressions gives the same behaviour for NaNs that
     calling the functions would.  For example, cosh(x) naturally wants
     to square x.  It may or may not do this explicitly for NaNs, but it
     should for the same reasons that we should (more so), and in fact it
     does: it returns x*x for both NaNs and Infs.  This gets the sign
     right for Infs, and quietens NaNs, and gives a sign for the NaN
     result in the most natural way (whatever the hardware does for x*x)
   - Here we have the option of quietening NaNs using some other expression,
     perhaps x+y (since we have another arg), or x+x (since we don't care
     about the sign of NaNs(.  But this might give gratuitously different
     results, depending on what the hardware does.
   - x86 hardware mostly doesn't touch the sign of NaNs when they are
     operated on.  The main exceptions are than fabs() clears the sign
     bit and negation may toggle the sign bit, depending on how negation
     is implemented by the compiler (i387 fchs toggles the sign, but
     compilers a subtraction operation which doesn't).  SSE gives annoying
     differences for NaNs, but IIRC these don't involve the sign bit for
     non-mixing operations.
   - the quiet bit can give annoying differences.  NaNs could be mixed
     using the expression x+y (if you prefer signs preserved) or x*y
     (if you prefer signs multiplied).  The result should be some functions
     of the inputs.  IIRC, IEEE7xx specifies this, but doesn't specify
     the details.  Most hardware does something like looking at the bits
     of the 2 NaNs and returning the bits of the "largest" one.  The
     main differences are whether the sign and the quiet bits to be
     included in the decision, and if so, how they affect the ordering.
     There are annoying differences from them sometimes affecting the
     ordering, depending on the hardware and the operand size.  The
     above operates first purely on each of x and y, in order to quieten
     them before mixing:
       x*y          might use the quiet bit in the ordering decision
       (x*x)*(y-y)  kills the quiet bit before mixing, so that it cannot
 		   affect the ordering descision
     Again, the second expression is what happens naturally if you
     write cosh(x)*sinh(y).
   )

Back to explaining the (y - y) terms:

> %        return (cpack(cosh(x) * (y - y), sinh(x) * (y - y)));

These are to optimize sin(y) and cos(y).  sin() is odd and cos() is
even, but parity considerations no longer apply for y = +-Inf because
the value is now NaN, so its sign is meaningless.  The sign of this
NaN is also unspecified, and we take advantage of this by letting
it be whatever the hardware gives for (y - y).  We also want to be
compatible with what the function call would do, and fdlibm's
sin() and cos() in fact do exactly this, for the usual technical
reasons.  Now y can be NaN, +-Inf or finite, and (y - y) handles
all cases.  In sin() and cos(), this expression handles the NaN
and +-Inf cases, so we are consistent.

> My understanding is that:
> - Addition is generally faster than multiplication

Apart from the technical (parity) reasons, we don't really care about
efficiency in these exceptional cases, though we are doing perhaps
excessive optimizations to avoid the function calls.

Another organization of the function might start by evaluating the
4 sub-functions in all cases, so that the results can be used in all
cases later.  IIRC, this is not done mainly because it isn't clear
that it doesn't give spurious exceptions.  It also gives efficiency
in a natural way for special but non-exceptional cases like pure
real and pure imaginary.

> - Signs are irrelevant for NaN so merging the sign of x doesn't matter.
> - NaN + NaN returns the (quietened?) left-hand NaN

See above.

> - Inf + NaN returns the (quietened?) right-hand NaN

I forget if IEEE7xx requires thus, but some C functions near this one
are required to treat a NaN as an indeterminate value which always combines
with Inf to produce Inf.  The sign of a NaN could reasonably matter
here:

   +Inf + +NaN = +Inf   since the indeterminate value of +NaN cannot be -Inf
   +Inf + -NaN = NaN    since the indeterminate value of -NaN can    be -Inf

> - finite + NaN returns the (quietened?) right-hand NaN
>
> Also, whilst things like ((x + x) * (y - y)) are reasonably efficient

Another point is that using consistent expressions allows common code
for the common parts (like x*x) if that is a good (space) optimization.

> on x86, few (if any) RISC architectures support exceptional conditions
> in hardware.  My understanding is that SPARC would trap back into the
> userland handler (lib/libc/sparc64/fpu) on each operation unless both
> arguments and the result are normalised numbers.  Explicitly fiddling
> with the FPU state would seem faster than multiple traps.

Here it is more the size of the magic expressions that gives slowness.
Almost all of them either start with a NaN or create one.

Even on x86, there is annoying slowness for overflow and underflow in
a MD way.  Exceptional cases for exp*() run much slower on i386(core2)
than on i386(athlon-any(?)) or amd64(core2) or amd64(athlon), because
instructions (if not their results) that overflow and underflow are
handled slowly on core2(i387) but at full speed on all the other
combinations including core2(SSE).

Bruce