libm functions sensitive to current rounding mode
Bruce Evans
brde at optusnet.com.au
Fri Mar 13 13:44:00 UTC 2015
On Fri, 13 Mar 2015, Bruce Evans wrote:
> On Fri, 13 Mar 2015, Bruce Evans wrote:
>
>> On Thu, 12 Mar 2015, enh via freebsd-numerics wrote:
>> ...
>> - ... but FreeBSD software sin has an enormous error for DOWNWARDS on
>> 1.570796 -- it rounds in the wrong direction by 0x11553 ulps :-(.
>> About 1 Mulp (1 Mulp = 1 mega-ulp).
>> ...
> A quick rerun of the test finds problems in much the same areas as the
> glibc thread. I must have put off fixing this since most functions work
> OK and there are larger bugs to fix (like the multi-Pulp errors given
> by using the i387).
> X exp: max_er = 0x1047a00a 0.5087, avg_er = 0.033, #>=1:0.5 = 0:2032
> X exp: max_er = 0x3800000020000000 7516192769.0000, avg_er = 1343.574,
> #>=1:0.5 = 7259405:8747670
> X exp: max_er = 0x39a138016 28.8149, avg_er = 1.162, #>=1:0.5 =
> 7256189:8776222
> X exp: max_er = 0x3800000020000000 7516192769.0000, avg_er = 1343.574,
> #>=1:0.5 = 7259405:8747670
> X exp2: max_er = 0x1006a936 0.5008, avg_er = 0.033, #>=1:0.5 = 0:581
> X exp2: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1056.147,
> #>=1:0.5 = 6412147:8351665
> X exp2: max_er = 0x20669100 1.0126, avg_er = 0.500, #>=1:0.5 =
> 6429089:8371868
> X exp2: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1056.147,
> #>=1:0.5 = 6412147:8351665
> X expm1: max_er = 0x105e998d 0.5115, avg_er = 0.031, #>=1:0.5 = 0:508
> X expm1: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1342.318,
> #>=1:0.5 = 3985247:4976979
> X expm1: max_er = 0x231f0351 1.0976, avg_er = 0.061, #>=1:0.5 = 1888:1014283
> X expm1: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1342.317,
> #>=1:0.5 = 3986102:4957995
>
> All the functions that are exp*f or use exp*f have problems, except exp*f
> in RP mode. coshf has lesser problems in RP mode. Its errors are then
> large but not enormous. Probably a small error in expf turns into a larger
> one after addition. (My coshf uses an expf kernel like coshl uses an
> expl kernel in -current.)
Bah, I just remember that exp and trig functions intentionally depend
on round-to-nearest mode, since other modes aren't really supported
and efficiency is gained by assuming this mode. The optimizations
started in the range reduction for exp2*. I didn't like exp2* being
so much faster than more important functions, and copied some of its
methods to trig and plain exp. Poorly documented copies of the
optimization replicated. My version now uses inline functions in
math_private.h for this. There is still a problem choosing the
algorithm. To work, calculations like double x; ... x + 0x1p52 - 0x1p52
need not only round-to-nearest, but also no extra precision. With
extra precision, a different adjustment must be used. But if the caller
can change the precision, this adjustment varies at runtime. On i386,
it assumed that the precision is normally 53 bits, and ENTERI() does the
necessary switch to 64-bit precision which matches the variation of
the adjustment to 0x1p63.
Here is my current centralized version:
X /*
X * The rnint() family rounds to the nearest integer for a restricted range
X * range of args (up to about 2**MANT_DIG). We assume that the current
^^^^^^^^^^^^^^^^^^^^^^^^^^
X * rounding mode is FE_TONEAREST so that this can be done efficiently.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
X * Extra precision causes more problems in practice, and we only centralize
X * this here to reduce those problems, and have not solved the efficiency
X * problems. The exp2() family uses a more delicate version of this that
X * requires extracting bits from the intermediate value, so it is not
X * centralized here and should copy any solution of the efficiency problems.
X */
X
X static inline double
X rnint(__double_t x)
X {
X /*
X * This casts to double to kill any extra precision. This depends
X * on the cast being applied to a double_t to avoid compiler bugs
X * (this is a cleaner version of STRICT_ASSIGN()). This is
X * inefficient if there actually is extra precision, but is hard
X * to improve on. We use double_t in the API to minimise conversions
X * for just calling here. Note that we cannot easily change the
X * magic number to the one that works directly with double_t, since
X * the rounding precision is variable at runtime on x86 so the
X * magic number would need to be variable. Assuming that the
X * rounding precision is always the default is too fragile. This
X * and many other complications will move when the default is
X * changed to FP_PE.
X */
X return ((double)(x + 0x1.8p52) - 0x1.8p52);
X }
X
X static inline float
X rnintf(__float_t x)
X {
X /*
X * As for rnint(), except we could just call that to handle the
X * extra precision case, usually without losing efficiency.
X */
X return ((float)(x + 0x1.8p23F) - 0x1.8p23F);
Switching the rounding mode to FP_RN here fixed the large error in expf().
I don't actually use this, since the cast to float is slow. I actually
use
X }
X
X #ifdef LDBL_MANT_DIG
X /*
X * The complications for extra precision are smaller for rnintl() since it
X * can safely assume that the rounding precision has been increased from
X * its default to FP_PE on x86. We don't exploit that here to get small
X * optimizations from limiting the rangle to double. We just need it for
X * the magic number to work with long doubles. ld128 callers should use
X * rnint() instead of this if possible. ld80 callers should prefer
X * rnintl() since for amd64 this avoids swapping the register set, while
X * for i386 it makes no difference (assuming FP_PE), and for other arches
X * it makes little difference.
X */
X static inline long double
X rnintl(long double x)
X {
X return (x + __CONCAT(0x1.8p, LDBL_MANT_DIG) / 2 -
X __CONCAT(0x1.8p, LDBL_MANT_DIG) / 2);
X }
X #endif /* LDBL_MANT_DIG */
X
X /*
X * irint() and irint64() give the same result as casting to their integer
X * return type provided their arg is a floating point integer. They can
X * sometimes be more efficient because no rounding is required.
X */
X #if (defined(amd64) || defined(__i386__)) && defined(__GNUCLIKE_ASM)
X #define irint(x) \
X (sizeof(x) == sizeof(float) && \
X sizeof(__float_t) == sizeof(long double) ? irintf(x) : \
X sizeof(x) == sizeof(double) && \
X sizeof(__double_t) == sizeof(long double) ? irintd(x) : \
X sizeof(x) == sizeof(long double) ? irintl(x) : (int)(x))
X #else
X #define irint(x) ((int)(x))
X #endif
X
X #define i64rint(x) ((int64_t)(x)) /* only needed for ld128 so not opt. */
X
X #if defined(__i386__) && defined(__GNUCLIKE_ASM)
X static __inline int
X irintf(float x)
X {
X int n;
X
X asm("fistl %0" : "=m" (n) : "t" (x));
X return (n);
X }
X
X static __inline int
X irintd(double x)
X {
X int n;
X
X asm("fistl %0" : "=m" (n) : "t" (x));
X return (n);
X }
X #endif
X
X #if (defined(__amd64__) || defined(__i386__)) && defined(__GNUCLIKE_ASM)
X static __inline int
X irintl(long double x)
X {
X int n;
X
X asm("fistl %0" : "=m" (n) : "t" (x));
X return (n);
X }
X #endif
amd64 doesn't need irintf() or irintd() since it has SSE instructions that
work better than fistl since they don't depend on the rounding mode and
don't require using a different register set, and compilers use these
automatically so the above optimizations are not useful. fistl uses the
current rounding mode, but the rounding mode doesn't matter so much for
irint*() because its arg is already an integer for its intended uses.
The above is used for range reduction in trig functions too. See
e_rem_pio2*.[ch]. Old versions of these used the following very slow
code so they worked in all rounding modes: first, a diff to use the
above:
X Index: e_rem_pio2.c
X ===================================================================
X RCS file: /home/ncvs/src/lib/msun/src/e_rem_pio2.c,v
X retrieving revision 1.25
X diff -u -2 -r1.25 e_rem_pio2.c
X --- e_rem_pio2.c 17 Nov 2012 01:50:07 -0000 1.25
X +++ e_rem_pio2.c 11 Nov 2013 09:17:08 -0000
X @@ -128,12 +128,6 @@
X if(ix<0x413921fb) { /* |x| ~< 2^20*(pi/2), medium size */
X medium:
X - /* Use a specialized rint() to get fn. Assume round-to-nearest. */
X - STRICT_ASSIGN(double,fn,x*invpio2+0x1.8p52);
X - fn = fn-0x1.8p52;
X -#ifdef HAVE_EFFICIENT_IRINT
X + fn = rnint((double_t)x*invpio2);
X n = irint(fn);
X -#else
X - n = (int32_t)fn;
X -#endif
X r = x-fn*pio2_1;
X w = fn*pio2_1t; /* 1st round good to 85 bit */
(The STRICT_ASSIGN() here is very slow on i386. Using double_t does it
it in essentially the same way. Note that rnint() returns double_t, as
is best for better-written functions. The slow conversion occurs on
assignment to 'double fn' here. fn and most other variables here should
have type double_t to avoid conversions.)
X Index: e_rem_pio2.c
X ===================================================================
X RCS file: /home/ncvs/src/lib/msun/src/e_rem_pio2.c,v
X retrieving revision 1.1
X diff -u -2 -r1.1 e_rem_pio2.c
X --- e_rem_pio2.c 19 Aug 1994 09:39:43 -0000 1.1
X +++ e_rem_pio2.c 11 Nov 2013 09:17:08 -0000
X @@ -98,15 +62,75 @@
X GET_HIGH_WORD(hx,x); /* high word of x */
X ix = hx&0x7fffffff;
X +#if 0 /* Must be handled in caller. */
X if(ix<=0x3fe921fb) /* |x| ~<= pi/4 , no need for reduction */
X {y[0] = x; y[1] = 0; return 0;}
X - if(ix<=0x413921fb) { /* |x| ~<= 2^19*(pi/2), medium size */
X - t = fabs(x);
X - n = (int32_t) (t*invpio2+half);
X - fn = (double)n;
X - r = t-fn*pio2_1;
X - w = fn*pio2_1t; /* 1st round good to 85 bit */
X ...
X + fn = rnint((double_t)x*invpio2);
X + n = irint(fn);
X + r = x-fn*pio2_1;
X + w = fn*pio2_1t; /* 1st round good to 85 bit */
X ...
The old method is:
- work with t = fabs(x) so that we know which way the rounding goes.
Elswhere, working with |x| often requires another slow step to
adjust things based on the sign of x.
- add half. This is part of manual rounding to nearest
- assign to 'int n'. The rounding of this is specified to be towards 0.
This tends to be the slowest step. It may involve changing the register
step. On i387 and early SSE, there is only an instruction that uses the
current rounding mode, which is usually FP_RN and not known to the
compiler, so the compiler must switch the rounding mode.
Inefficiencies here cost at least 20 cycles (throughput) on modern x86.
This is a lot when you are trying to make the whole function run in 50
cycles.
The algorithm depends critically on round-to-nearest. Another algorithm
is to extract the bits of x*invpio2 and adjust them in some way. The
equivalent of rounding to nearest is to add a bias before right-shifting
the bits to get n. Then assign n to fn. This tends to be slower because
the FPU has to wait for the integer ALU to calculate n. The above produces
fn as soon as possible. The integer ALU has to wait. This wait can be
a bottleneck, but it is usually not as slow as making the FPU wait.
exp has a chance of rounding results in the correct direction if it just
does arg reduction in that direction, since it is monotic. At least for
FP_RP and FP_RM. Not for FP_RZ -- rounding of the result towards zero
requires rounding of the arg towards minus infinity.
Trig functions have no similar chances, since they are not monotonic.
Arg reduction almost always gives an arg that is too large or too small.
Then the result may be too large or too small, respectively, depending
on the slope of the function and the rounding mode. I don't plan to
fix this. Just keep the error down to 1.5 ulps.
Bruce
More information about the freebsd-numerics
mailing list