Implementation of half-cycle trignometric functions

Thu May 18 22:56:41 UTC 2017

On Thu, 18 May 2017, Steve Kargl wrote:

> On Thu, May 18, 2017 at 04:34:57PM +1000, Bruce Evans wrote:
>> On Wed, 17 May 2017, Steve Kargl wrote:
>>
>>> On Thu, May 18, 2017 at 09:25:58AM +1000, Bruce Evans wrote:
>>>> On Wed, 17 May 2017, Steve Kargl wrote:
>>> ...
>>>>> As such, I've added a comment
>>>>>
>>>>> /*
>>>>> * pi_hi contains the leading 56 bits of a 169 bit approximation for pi.
>>>>> */
>>>>
>>>> Why 56?  53 + 56 = 109.
>>>
>>> This is ld128.
>>
>> ld128 has precision 113.
>
> Yes. I know.
>
>>> static const long double
>>> pi_hi = 3.14159265358979322702026593105983920e+00L,
>>> pi_lo = 1.14423774522196636802434264184180742e-17L;
>>>
>>> pi_hi has 113/2 = 56 bits.
>>> pi_lo has 113 bit.
>>>
>>> 56 + 113 = 169
>>
>> But hi is set intentionally sloppily by converting to double, so it has
>> 53 bits instead of the 57 that 56 would go with (57 = 113/2 rounded up).
>
> A 53-bit entity times a 56-bit entity needs 109-bits for the result.
> A 113-bit entity can hold a 109-bit entity.  Or am I missing something?

It wastes 4 bits.  This reader has to guess if this is intentional for
magic elsewhere, or just sloppy.

Also, the rounded value might naturally have some extra low zero bits.
So looking at the bits, it is not easy to see if there are the right
number.  This is not worth a comment.  The reader should trust that
the right (maximal) number of bits are used, and only check it
approximately.

The double precision pi*hi and pi*lo also seemed to be sloppily
rounded.  My version has less precision (28 low zero bits in pi_hi
instead of the minimal 24), while yours has 27 low zero bits in
pihi.  The explanation for this is something like:
- I copied pi_hi from another file with slightly different requirements
- the choice makes pi_lo positive but pilo negative.  The other file
   might have need pi_lo positive, or just extra zero bits in pi_hi.
   Most likely just the latter.
- your 27 is apparently from the calculation 53 - 53 / 2 for almost
   even splitting of x, but the splitting of x is uneven (24 + 29).

>>> I'll need to wait until the weekend to study your improved sin_pi.
>> ..
>> However, both versions fail tests.  Min has a max error of 2048 ulps
>> and an everage error of 928 ulps and yours has a maximum error of many
>> giga-ups for the sign bit wrong and an average error of 561 ulps.  The
>> enormous errors are probably for denormals or NaNs.

The were mostly incomplete setup of the tests.  After fixing this, many
real bugs were found:
- 2 or 3 wrong magic numbers in your version (different ones for sinpi()
   and sinpil())
- wrong quadrant in my version
- incompatible signs in my version (sign errors are reported as value
   errors in the exa-ulp range, so my tests can't pass with even 1)
- the non-use of double_t breaks sinpi() completely, and gives errors
   of up to about 1.4 ulps (slightly worse than a simple multiplication
   by M_PI).  Using double_t is no better, since the kernels don't support
   it.  hi+lo ends up with extra precision in hi either way, and this
   gets destroyed on calling the kernel.  The tests find this by switching
   the mode to extra precision to actually test it.  The bug would be
   more obvious using the same algorithm in float precision, since then
   the default precision would be extra.

   This can be fixed using STRICT_ASSIGN() or volatile hacks in _2sumF(),
   but that would turn _2sumF() into back into one of its slow development
   versions.  Options for controlling this were intentionally left out
   to get an easy to ise API for _2sumF().

   The quick fix of declaring hi and lo volatile seems to work, but is
   probably slower than doing it in _2sumnvF() and is certainly worse
   without extra precision.  Only 1 more strict assignment is needed.

   The quick fix of compiling with -ffloat-store works.

> You're correct that I did not give much consideration to subnormal
> number.  For float this, can be fixed by
>
> 	 		if (ix < 0x38800000) {	/* |x| < 0x1p-14 */
> 				if (x == 0)
> 					return (x);
> 				SET_FLOAT_WORD(hi, hx & 0xffff0000);
> 				if (ix <= 0x0b7f0001) { /* Avoid spurios underflow. */
> 					hi *= 0x1p23F;
> 					lo = x * 0x1p23F - hi;
> 					s = (pi_lo + pi_hi) * lo + pi_lo * hi +
> 					    pi_hi * hi;
> 					return (s * 0x1p-23F);
> 				}
> 				lo = x - hi;
> 				s = (pi_lo + pi_hi) * lo + pi_lo * hi +
> 				    pi_hi * hi;
> 				return (s);
> 			}

Far too verbose.  The unfixed version was bad enough.

> or if you want to avoid the extra if-statement.  Simply scale
>
> 	 		if (ix < 0x38800000) {	/* |x| < 0x1p-14 */
> 				if (x == 0)
> 					return (x);
> 				SET_FLOAT_WORD(hi, hx & 0xffff0000);
> 				hi *= 0x1p23F;
> 				lo = x * 0x1p23F - hi;
> 				s = (pi_lo + pi_hi) * lo + pi_lo * hi +
> 				    pi_hi * hi;
> 				return (s * 0x1p-23F);
> 			}

Better, but this is bogus for float.  Just multiply by M_PI in double
precision, as is done everywhere else in the function.

My version for double precision moves this multiplication to a kernel
to defend against n-tuplication, but botches the scaling by splitting
x before scaling it:

XX static inline double
XX __kernel_mpi(double x)
XX {
XX 	double hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	return (0x1p1000 * pi_lo * x + 0x1p1000 * pi_hi * lo +
XX 	    0x1p1000 * pi_hi * hi) * 0x1p-1000;
XX }
XX ...
XX 	if (ix < 0x3fd00000) {		/* |x| < 0.25 */
XX 	    if (ix < 0x3e200000) {	/* |x| < 2**-29 */
XX 		if (x == 0)
XX 		    return x;
XX 		if ((int)x == 0)	/* generate inexact if x != 0 */
XX 		    return __kernel_mpi(x);
XX 	    }
XX 	    return __kernel_sinpi(x);
XX 	}

The hi+lo decomposition can also be done in a kernel.  This requires
returning hi and lo indirectly, which is fast enough with inlining to
make the indirections not actually indirect.  My exp kernels return
hi, lo, and also (approx log2(x)) indirectly, where hi and lo are
closer to the final result.  Other exp kernels re-assemble with several
variations of (hi + lo) * 2**k and return that directly.

My sinpif() now passes all tests and gives the same results as yours,
except possibly near 0.

The maximum error in float precision is about 0.54 ulps.  This is much
larger than 0.5009 ulps for sin() and cos().  I have no explanation.
The approx. to pi should give only 12 bits of inaccuracy, so the errror
shouldn't be much more than 1/1024 more.

XX float
XX sinpif(float x)
XX {
XX 	float s,y,z;
XX 	int32_t hx,ix;
XX 	int n;
XX 
XX 	GET_FLOAT_WORD(hx,x);
XX 	ix = hx & 0x7fffffff;
XX 
XX 	if (ix < 0x3e800000) {		/* |x| < 0.25 */
XX 	    if (ix < 0x38800000)	/* |x| < 2**-14 */
XX 		if(((int)x)==0) return M_PI*x; /* pi*x with inexact if x != 0 */
XX 	    return __kernel_sindf(M_PI*x);
XX 	}

We multiply by M_PI for the usual case, so it is silly not to multiply by
it when x is small.

XX 
XX 	if (ix>=0x7f800000) return x-x;	/* Inf or NaN -> NaN */
XX 
XX 	if (ix >= 0x4b800000) return (x < 0 ? -0.0F : 0); /* even int -> +-0 */

You said that n1950.pdf documents returning +0 at even integers and -0 at
odd intgers.  I couldn't find n1950.pdf, and your version doesn't do
anything like that.  It just returns +0 for all nonnegative integers and
-0 for all negative integers, as actually makes sense.

The threshold in the above is to classify even integers so as to possibly
return +0 for them and -0 as part of the general classification below.  It
can be reduced to possibly give special cases for odd and even large
integers above, as in your versions.

XX 
XX 	y = fabs(x);
XX 
XX 	STRICT_ASSIGN(float,z,y+0x1p21F);
XX 	GET_FLOAT_WORD(n,z);		/* bits for rounded y (units 0.25) */
XX 	z -= 0x1p21F;			/* y rounded to a multiple of 0.25 */
XX 	if (z > y) {
XX 	    z -= 0.25F;			/* adjust to round down */
XX 	    n--;
XX 	}
XX 	n &= 7;				/* octant of y mod 2 */
XX 	y -= z;				/* y mod 0.25 */
XX 
XX 	if (n >= 3 && n < 7)
XX 	    s = -1;
XX 	else
XX 	    s = 1;
XX 	if (x < 0)
XX 	    s = -s;
XX 	n &= 3;
XX 	if (y == 0 && n == 0)
XX 	    return (x < 0 ? -0.0F : 0);

I don't like the complications for signs, but they seem to execute fast
enough using direct logic.  On modern OOE x86, 's' can be calculated in
parallel, so it only costs about 4 cycles of latency to assemble with it
at the end, and my tests de-emphasize latency a bit too much.

XX 	if (n == 0)
XX 	    return s*__kernel_sindf(M_PI*y);
XX 	else if (n == 1)
XX 	    return s*__kernel_cosdf(M_PI*(y-0.25F));
XX 	else if (n == 2)
XX 	    return s*__kernel_cosdf(M_PI*y);
XX 	else
XX 	    return s*__kernel_sindf(M_PI*(y-0.25F));
XX }

Bruce