kern/133583: [libm] fma(3) does not respect rounding mode using extended precision

Alexander Best arundel at freebsd.org
Tue Nov 23 23:50:10 UTC 2010


The following reply was made to PR kern/133583; it has been noted by GNATS.

From: Alexander Best <arundel at freebsd.org>
To: bug-followup at freebsd.org
Cc:  
Subject: Re: kern/133583: [libm] fma(3) does not respect rounding mode using extended precision
Date: Tue, 23 Nov 2010 23:40:18 +0000

 this is what bruce evans wrote concerning this issue.
 
 cheers.
 alex
 
 ----- Forwarded message from Bruce Evans <brde at optusnet.com.au> -----
 
 Date: Tue, 16 Nov 2010 05:44:03 +1100 (EST)
 From: Bruce Evans <brde at optusnet.com.au>
 To: Alexander Best <arundel at FreeBSD.org>
 cc: Bruce Evans <brde at optusnet.com.au>, Ulrich Spoerlein <uqs at FreeBSD.org>,
         das at freebsfd.org
 Subject: Re: svn commit: r215237 - head/lib/msun/src
 
 [Cc trimmed]
 
 On Mon, 15 Nov 2010, Alexander Best wrote:
 
 >if you are interested in solving two more msun mysteries, you might want to
 >have a look at #PR kern/133583 and standards/143358.
 
 Here is a quick fix for #133583.
 
 % Index: s_fma.c
 % ===================================================================
 % RCS file: /home/ncvs/src/lib/msun/src/s_fma.c,v
 % retrieving revision 1.5
 % diff -u -2 -r1.5 s_fma.c
 % --- s_fma.c	3 Apr 2008 06:14:51 -0000	1.5
 % +++ s_fma.c	15 Nov 2010 17:44:48 -0000
 % @@ -170,5 +170,5 @@
 %  	zs = ldexp(zs, -spread);
 %  	r = c + zs;
 % -	s = r - c;
 % +	*(volatile *)&s = r - c;
 %  	rr = (c - (r - s)) + (zs - s) + cc;
 %
 
 The basic problem is that s_fma.c uses Dekker's algorithm, which assumes
 working floating point, i.e., not the floating point given by gcc on
 i387.  FreeBSD defaults to rounding to double precision so as to mostly
 avoid this problem for doubles, and msun tries to and mostly succeeds
 in avoiding it using direct methods for floats (using STRICT_ASSIGN()
 and sometimes a more direct volatile cast or variable), but setting
 extended precision as the test program does exposes all double precision
 functions to it.  This is 1 reason why the regression tests shouldn't
 configure extended precision for double precision tests (unless they
 want to test this, and we're not really ready for this since many other
 double precision functions are known to be broken when extended precision
 is configured).
 
 The problem was obvious since it went away with -O0 and with -ffloat-store.
 The above patch is the result of sprinkling volatiles around until I
 found one that worked and then removing the ones that made no difference.
 Perhaps many more are needed.  In particular, the splitting into high
 and low parts won't work as intended.  The test in the PR somehow
 worked without fixing this, probably just because it uses special args that
 are already have enough low bits.
 
 Note that STRICT_ASSIGN() doesn't work for doubles (is just null),
 since it is intentionally optimized for the default FreeBSD rounding
 precision, and using it (when it is not null) is a large pessimization.
 The following functions in msun/src (and few or none elsewhere) use it:
 
 % e_rem_pio2.c:	    STRICT_ASSIGN(double,fn,x*invpio2+0x1.8p52);
 % e_rem_pio2f.c:	    STRICT_ASSIGN(double,fn,x*invpio2+0x1.8p52);
 % k_rem_pio2.c:		STRICT_ASSIGN(double,fw,fw);
 % s_exp2.c:	STRICT_ASSIGN(double, t, x + redux);
 % s_exp2f.c:	STRICT_ASSIGN(float, t, x + redux);
 % s_log1p.c:		STRICT_ASSIGN(double,u,1.0+x);
 % s_log1pf.c:		STRICT_ASSIGN(float,u,(float)1.0+x);
 % s_rint.c:	        STRICT_ASSIGN(double,w,TWO52[sx]+x);
 % s_rint.c:	STRICT_ASSIGN(double,w,TWO52[sx]+x);
 % s_rintf.c:		STRICT_ASSIGN(float,w,TWO23[sx]+x);
 % s_rintf.c:	    STRICT_ASSIGN(float,w,TWO23[sx]+x);
 
 All the double-precision uses here are null and are just clones of the
 float precision code where the use avoids a problem that occurs in the
 usual case.  Since the double precision uses are null, all the functions
 with such uses are broken if someone configures extended precision.
 This includes all trig functions (via *_rem_pio2.c).  I have fixed
 this locally only for the trig functions.  In k_rem_pio2_.c, I essentially
 just use a variant of STRICT_ASSIGN(), but a better way is to add and
 subtract a bias (something like 0x1.8p53 (52 DBL_MANT_DIG - 1) to push
 the unwanted bits to oblivion).  I use such a bias to avoid the
 STRICT_ASSIGN() in in s_exp2f.c:
 
 % --- s_exp2f.c	Fri Feb 22 14:46:43 2008
 % +++ z22/s_exp2f.c	Fri Feb 22 14:48:37 2008
 % @@ -33,14 +33,50 @@
 %  #include "math_private.h"
 % 
 % +/* To be moved nearer to <float.h>. */
 % +#if FLT_EVAL_METHOD == 0
 % +#define	FLT_T_MANT_DIG	FLT_MANT_DIG
 % +#else
 % +/*
 % + * XXX this hack works for all supported arches.  There is some doubt that
 % + * float_t actually is double on arm and powerpc as claimed in <math.h>,
 % + * but it is certainly not long double, and the reduction only needs
 % + * float_t to be at least as large as the evaluation precision.
 % + */
 % +#define	FLT_T_MANT_DIG	DBL_MANT_DIG
 % +#endif
 % +
 %  #define	TBLBITS	4
 %  #define	TBLSIZE	(1 << TBLBITS)
 % 
 % +/*
 % + * Ensure that redux fits in a float.  float_t would always work, but
 % + * float works on all supported arches and is more efficient on some.
 % + */
 % +#if FLT_T_MANT_DIG >= FLT_MAX_EXP + TBLBITS
 % +#error "Unsupported type for float_t"
 % +#endif
 % +
 %  static const float
 %      huge    = 0x1p100f,
 % -    redux   = 0x1.8p23f / TBLSIZE,
 % +    redux   = __CONCAT(0x1.8p, FLT_T_MANT_DIG) / 2 / TBLSIZE,
 % +/*
 % + * Domain [-0.03125, 0.03125], range ~[-6.015e-11, 5.960e-11]
 % + * |exp2(x) - p(x)| < 2**-33.95
 % + */
 %      P1	    = 0x1.62e430p-1f,
 %      P2	    = 0x1.ebfbe0p-3f,
 %      P3	    = 0x1.c6b348p-5f,
 %      P4	    = 0x1.3b2c9cp-7f;
 % +#if 0
 % +/*
 % + * Domain [-0.03125, 0.03125], range ~[-2.4881e-12, 2.4881e-12]:
 % + * |exp2(x) - p(x)| < 2**-38.54
 % + */
 % +static const double
 % +    P1 =  6.9314718016256749e-1,	/*  0x162e42fec39c72.0p-53 */
 % +    P2 =  2.4022650689211292e-1,	/*  0x1ebfbdff5df1dd.0p-55 */
 % +    P3 =  5.5505736316289703e-2,	/*  0x1c6b3f74700eea.0p-57 */
 % +    P4 =  9.6183434022792860e-3;	/*  0x13b2c832d5fd3b.0p-59 */
 % +#endif
 % 
 %  static volatile float twom100 = 0x1p-100f;
 % @@ -69,4 +105,8 @@
 %   *
 %   * Accuracy: Peak error < 0.501 ulp; location of peak: -0.030110927.
 % +#if 0
 % + * Accuracy: Peak error < 0.50004 ulp; location of peak: -0.00918653142
 % + * 5168 cases out of 2**32 are incorrectly rounded in round-to-nearest 
 mode.
 % +#endif
 %   *
 %   * Method: (equally-spaced tables)
 % @@ -95,5 +135,5 @@
 %  {
 %  	double tv, twopk, u, z;
 % -	float t;
 % +	float_t t;
 %  	uint32_t hx, ix, i0;
 %  	int32_t k;
 % @@ -118,12 +158,18 @@
 % 
 %  	/* Reduce x, computing z, i0, and k. */
 % -	STRICT_ASSIGN(float, t, x + redux);
 % +	t = x + redux;
 % +#if FLT_T_MANT_DIG == 24
 %  	GET_FLOAT_WORD(i0, t);
 % +#elif FLT_T_MANT_DIG == 53
 % +	GET_LOW_WORD(i0, t);
 % +#else
 % +#error "Unsupported type for float_t"
 % +#endif
 % ...
 
 This also has complications to use float_t.  float_t will normally be
 double or long double, and the bias (redux here) depends on the number
 of bits in the type used in the redux.  The basic idea is to use the
 natural type for operations (float_t, not float here), so that adding
 and subtracting biases just works and the correct bias can be known
 at compile time.  This is painful to configure (see above), is broken
 by anywone configuring extra precision (e.g., the 53 in the above
 depends on the precision remaining at double).
 
 My uncommitted logl() uses a modified subset of Dekker's algorithm without
 hitting the problems with extra precision or needing to use STRICT_ASSIGN()
 to avoid them (any extra precision just gives extra precision in the result).
 I think the additive part doesn't really care about extra precision.  It
 The multicative part cares since it needs the hi*hi multiplications to be
 exact, so the hi parts must not have extra bits.
 
 Bruce
 
 ----- End forwarded message -----
 
 -- 
 a13x


More information about the freebsd-bugs mailing list