i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Sun Feb 6 04:30:23 PST 2005

The following reply was made to PR i386/67469; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: David Schultz <das at freebsd.org>
Cc: FreeBSD-gnats-submit at freebsd.org, freebsd-i386 at freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Sun, 6 Feb 2005 23:27:56 +1100 (EST)

 On Sun, 6 Feb 2005, David Schultz wrote:

 > On Sun, Feb 06, 2005, Bruce Evans wrote:
 > ...
 > > BTW, I don't trust the range reduction for floating point pi/2 but
 > > never finished debugging or fixing it.  According to my old comment,
 > > it is off by pi/2 and very slow for many values between 32768 and
 > > 65535.  However, I couldn't duplicate this misbehaviour last time I
 > > looked at it.  I used to fix it using the double version.
 >
 > Yeah, there seem to be many values for which __ieee754_rem_pio2f()
 > returns the wrong quotient:
 >
 > -0x1.8009c6p+8       (input)
 >         d = -244     (return value of __ieee754_rem_pio2())
 >         f = -4       (return value of __ieee754_rem_pio2f())
 > ...
 > It also gets the remainder completely wrong sometimes:
 >
 > 0x1.389ee4p+87
 >         rd0 = -0x1.6ad286p-4       (y[0] from rem_pio2(), rounded to float)
 >         rf0 = 0x1.7b728cp+0        (y[0] from rem_pio2f())
 > ...

 I didn't check these.  When I tried to debug this, I got confused by
 y[0] quite often legitimately differing, because the result is in y[1]
 and the integer result too and there are many equivalent combinations.

 I also started checking the float trig functions on all 2^32 possible
 args, but got discouraged by too many differences of more than 1 ulp.

 > Also, rem_pio2f() is probably not much more efficient than
 > rem_pio2().  The former would be better if it used a single double
 > instead of two floats for increased accuracy.  (The double version
 > has two use two doubles for accuracy because there's no ``double
 > double'' type.)

 As you know, this is related to the general uselessness of the float
 interfaces on machines with doubles.  But SSE1 makes this more interesting
 even on i386's.  Floats might be faster despite the existence of doubles
 because a different ALU can be used for them.

 > Results for MI tan() differ from MI tanf() by >2 ulp for many
 > inputs, including:
 >
 > 	0x1.00008ep+15
 > 	0x1.0000ap+15
 > 	0x1.0000a4p+15
 > 	0x1.0000aep+15
 > 	[...]
 >
 > Perhaps this is due to the problem with rem_pio2f().

 It does.  Reverting to my version of rem_pio2f() that just calls
 rem_pio2() fixes the MI tanf() on all these values.  Here is the patch.
 The changes in the "#if 0" part seemed to help for some cases but they
 don't help much for the above values.

 %%%
 Index: e_rem_pio2f.c
 ===================================================================
 RCS file: /home/ncvs/src/lib/msun/src/e_rem_pio2f.c,v
 retrieving revision 1.7
 diff -u -2 -r1.7 e_rem_pio2f.c
 --- e_rem_pio2f.c	28 May 2002 17:51:46 -0000	1.7
 +++ e_rem_pio2f.c	6 Feb 2005 11:53:53 -0000
 @@ -58,10 +58,10 @@
     single precision and the last 8 bits are forced to 0.  */
  static const int32_t npio2_hw[] = {
 -0x3fc90f00, 0x40490f00, 0x4096cb00, 0x40c90f00, 0x40fb5300, 0x4116cb00,
 -0x412fed00, 0x41490f00, 0x41623100, 0x417b5300, 0x418a3a00, 0x4196cb00,
 -0x41a35c00, 0x41afed00, 0x41bc7e00, 0x41c90f00, 0x41d5a000, 0x41e23100,
 -0x41eec200, 0x41fb5300, 0x4203f200, 0x420a3a00, 0x42108300, 0x4216cb00,
 -0x421d1400, 0x42235c00, 0x4229a500, 0x422fed00, 0x42363600, 0x423c7e00,
 -0x4242c700, 0x42490f00
 +0x3fc90f80, 0x40490f80, 0x4096cb80, 0x40c90f80, 0x40fb5380, 0x4116cb80,
 +0x412fed80, 0x41490f80, 0x41623180, 0x417b5380, 0x418a3a80, 0x4196cb80,
 +0x41a35c80, 0x41afed80, 0x41bc7e80, 0x41c90f80, 0x41d5a080, 0x41e23180,
 +0x41eec280, 0x41fb5380, 0x4203f200, 0x420a3a80, 0x42108300, 0x4216cb80,
 +0x421d1400, 0x42235c80, 0x4229a500, 0x422fed80, 0x42363600, 0x423c7e80,
 +0x4242c700, 0x42490f80,
  };

 @@ -88,6 +88,8 @@
  pio2_3t =  6.1232342629e-17; /* 0x248d3132 */

 -	int32_t __ieee754_rem_pio2f(float x, float *y)
 +int32_t
 +__ieee754_rem_pio2f(float x, float *y)
  {
 +#if 0
  	float z,w,t,r,fn;
  	float tx[3];
 @@ -129,5 +131,5 @@
  	    r  = t-fn*pio2_1;
  	    w  = fn*pio2_1t;	/* 1st round good to 40 bit */
 -	    if(n<32&&(ix&0xffffff00)!=npio2_hw[n-1]) {
 +	    if(n<32&&(ix&0xffffff80)!=npio2_hw[n-1]) {
  		y[0] = r-w;	/* quick check no cancellation */
  	    } else {
 @@ -177,3 +179,19 @@
  	if(hx<0) {y[0] = -y[0]; y[1] = -y[1]; return -n;}
  	return n;
 +#else
 +	/*
 +	 * The above is broken for many values of x between 32768
 +	 * and 65536-epsilon.  It is wrong by +pi/2 at best.  It
 +	 * is also very slow for these values.  Just use the double
 +	 * precision version.  This only works on machines with
 +	 * double precision of course.
 +	 */
 +	double new_y[2];
 +	int n;
 +
 +	n = __ieee754_rem_pio2((double)x, new_y);
 +	y[0] = new_y[0];
 +	y[1] = (new_y[0] - y[0]) + new_y[1];
 +	return n;
 +#endif
  }
 %%%

 The comment about off by pi/2 errors is probably wrong.  I now think the
 problem is that small errors errors lead to n being off by 1.  Then y[0]
 is off by pi/2 to sort of compensate.

 Bruce