libm functions sensitive to current rounding mode

Fri Mar 13 11:14:05 UTC 2015

On Thu, 12 Mar 2015, enh via freebsd-numerics wrote:

> various SoC vendors have been sending us (Android) assembler
> implementations of various libm functions, and we noticed when looking
> at one change that it was slower on certain inputs than the FreeBSD C
> implementation we were already using. the response was that the cost
> was coping with other rounding modes, and they gave this example:

This seems to be due to bad fixes.  Supporting rounding in the current
rounding mode is only very useful if the rounding is perfect, but perfect
rounding is difficult for complicated functions.  Switching the rounding
mode is incredibly slow on x86 (it takes as long or longer than sin()),
and doing it usually gives worse results.  The result _should_ depend on
the rounding mode.  It is specified to do so for sqrt* and probably for
many other simple functions.  This has

FreeBSD has only been tested to do perfect rounding in all rounding modes
for the following functions:

     ceil*, fabs*, floor*, logb*, rint*, round*, significan*, sqrt*, trunc*
     ceil*, fabs*, floor*, logb*, rint*, round*, significan*, sqrt*, trunc*

Implementing this for the software sqrt* makes sqrt* unusably slow, but
most arches use hardware sqrt*.

> #include <math.h>
> #include <stdio.h>
> #include <fenv.h>
>
> int main(){
>    double x = -0xb.6365e22ee46fp4;
>    fesetround(FE_UPWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>    fesetround(FE_DOWNWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>
>    x = 0x1.921fb54442d16p0;
>    fesetround(FE_UPWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>    fesetround(FE_DOWNWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>    return 0;
> }
>
> see https://android-review.googlesource.com/112700 for the original
> change and related conversation.

I couldn't read this with my old browsers.

> glibc 2.19:
>
> sin(-182.212373) = -2.4759225463534308e-18
> sin(-182.212374) = -2.4759225463534309e-18
> sin(1.570797) = 1.0000000000000000e+00
> sin(1.570796) = 1.0000000000000000e+00

sin() is a good-bad example.  If you "optimize" it by writing it in asm,
then you get enormous errors (multiple tera-ulps) even near small even
multiples of Pi/2.  The 1-ulp difference made by the rounding mode is
in the noise.

Actual results for i387 sin() on i386, with a middle result for FE_TONEAREST
added:

    sin(-182.212373) = -2.7105054312137606e-18
    sin(-182.212374) = -2.7105054312137611e-18
    sin(-182.212374) = -2.7105054312137611e-18
    sin(1.570797) = 1.0000000000000000e+00
    sin(1.570796) = 1.0000000000000000e+00
    sin(1.570796) = 9.9999999999999988e-01

Results should be printed in %a format so that the bits can be seen.

i387 values:

    sin(-182.212373) = -0x1.8ffffffffffffp-59
    sin(-182.212374) = -0x1.9p-59
    sin(-182.212374) = -0x1.9p-59
    sin(1.570797) = 0x1p+0
    sin(1.570796) = 0x1p+0
    sin(1.570796) = 0x1.fffffffffffffp-1

We see that the rounding mode indeed makes a 1-bit difference.

software values on i386 (almost the same as below):

    sin(-182.212373) = -2.4759225463534304e-18
    sin(-182.212374) = -2.4759225463534308e-18
    sin(-182.212374) = 2.2575413760606011e-11
    sin(1.570797) = 1.0000000000000000e+00
    sin(1.570796) = 1.0000000000000000e+00
    sin(1.570796) = 1.0000000002522273e+00
    sin(-182.212373) = -0x1.6d61b58c99c42p-59
    sin(-182.212374) = -0x1.6d61b58c99c43p-59
    sin(-182.212374) = 0x1.8d26ap-36
    sin(1.570797) = 0x1p+0
    sin(1.570796) = 0x1p+0
    sin(1.570796) = 0x1.000000011553bp+0

We see much more from this:
- somehow TONEAREST gives almost the same results as UPWARDS.  It appears
   to work correctly at -182.212373 in the FreeBSD software case only
- enormous errors at -182.212373 from the i387, as predicted (the software
   result is believed to be correct, while the i387 result shows evidence
   of canceling all except the top 5 bits).  All bits except the top bit
   are wrong, so the error is about 2**52 = 4 Pulps (1 Pulp = 1 peta-ulp).
   DOWNWARDS makes a 1 ulp adjustment to this.
- glibc apparently uses software, since it is correct in all cases, except
   it apparently forces a common rounding for UPWARDS and NEAREST and thus
   is certain to be off by 1 ulp for 1 of them.
- the 1.57079x are was obviously chosen to be as close as possible to Pi/2
   (it should be printed in %a format too, so that we can see its bits
   (these must be more accurate than shown by the decimal format) and see
   that the bits don't depend on the rounding mode).  Since sin() has a
   maximum at this value, and the infinitely-precise result at exactly Pi/2
   is exactly 1, the approximate internal result must be slightly less than
   1, and UPWARDS and TONEAREST should round it to 1, while DOWNWARDS should
   round it to 1 ulp less than 1.  Hardware sin does this.  glibc sin
   apparently does extra work to round in the wrong direction...
- ... but FreeBSD software sin has an enormous error for DOWNWARDS on
   1.570796 -- it rounds in the wrong direction by 0x11553 ulps :-(.
   About 1 Mulp (1 Mulp = 1 mega-ulp).

> (glibc's fixed bugs here in the past. see
> https://sourceware.org/bugzilla/show_bug.cgi?id=3976 for example.)

This is interesting.  Vincent L certainly knows what he is doing.  As
stated in the thread, the the library can do anything it wants in
non-default rounding modes provided it documents this.  It apparently
uses the bad fix of changing the rounding mode to FE_TONEAREST.  This
is slow and guarantees an error of 1 ulp in most cases, so to avoid
much larger errors in hopefully most cases.

Long ago, I all of the most common functions for such errors,
exhaustively for float functions, and didn't find any of more than an
ulp or 2.  I might have only done this on i386.  i386 protects against
such errors in float precision by evaluating intermediate results in
higher precision.  In non-FreeBSD, the default rounding precision is
not double, so i386 should also protect agains such errors in double
precision.  Special cases like multiples of Pi/2 are too hard to find
without special tests except in float precision.

> the FreeBSD libm:
>
> sin(-182.212374) = -2.4759225463534304e-18
> sin(-182.212374) = 2.2575413760606011e-11
> sin(1.570796) = 1.0000000000000000e+00
> sin(1.570796) = 1.0000000002522276e+00

Verified above.

> it looks like e_sqrtl.c, s_fmal.c, s_fma.c, and s_fmaf.c save/restore
> the rounding mode but nothing else does.

This makes these functions incredibly slow.

> let me know if you'd like me to send trivial patches for any of the
> affected functions. (do all the libm functions need this?)

This cannot be fixed using trivial patches.  Almost all transcendental
functions are probably affected.

Actually, it is possible to fix many long double functions without
making them more than twice as slow in the usual case where the rounding
mode is not the default.  The ENTERI() macros does a conditional mode
switch to change the rounding precision from 53 bits to 64 bits, so that
long doubles work properly.  It avoids the switch when it would be null.
   (This reminds me that modern x86 does the same in hardware for at least
    switching the rounding mode, so unconditional switching of the rounding
    mode might not be so slow when the switch is null.

    Also, fpset*() is pessimized in space and time to avoid traps, while
    feset*() never bothered with this.  To switch the rounding mode, you
    should use fesetround(), but fesetprec() doesn't exist so ENTERI() has
    to use fpsetprec().)
You could try adding unconditional rounding mode switches to ENTERI() to
test the long double case.  However, testing long doubles is harder.

exp() is mentioned in the glibc thread.  It has some details of interest
in FreeBSD.  On i386 only, it written in "optimized" asm that is broken
as follows:

X 	/*
X 	 * Extended precision is needed to reduce the maximum error from
X 	 * hundreds of ulps to less than 1 ulp.  Switch to it if necessary.
X 	 * We may as well set the rounding mode to to-nearest and mask traps
X 	 * if we switch.
X 	 */

I thought that this rounding mode switch was a good idea when I wrote this
20+ years ago.  Now I know better, and haven't used this version for 3 years.
The software version never did this.  I now use the software version on
i386 as well as on amd64, after optimizing it a bit so that it is a little
more accurate than this asm (hardware i387) version.  The software version
has been faster for many years on x86.

The precision mode switch here is part of what makes it slow.  glibc
probably assumes that the default rounding precision is 64 bits, so that
it doesn't need this mode switch.  It turns out that 64 bits is just
enough to reduce the error in double precision to below 1 ulp.  expl()
cannot be written like this in asm, since there is no mode switch to
get more than long double precision.  Not switching would give the same
large errors in long double precision that this mode switch fixes for
double precision.

The rounding mode switch accidentally "fixes" any algorithm instability
in i386 exp() in the same way as in glibc.  But I think the extra precision
gives enough stablity.

Bruce