Implementation of half-cycle trignometric functions

Sat Apr 29 00:59:26 UTC 2017

On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> 
> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
> 
> static const float
> s0hi =  3.14062500e+00f,	/* 0x40490000 */
> s0lo =  9.67653585e-04f,	/* 0x3a7daa22 */
> s1   = -5.16771269e+00f,	/* 0xc0a55de7 */
> s2   =  2.55016255e+00f,	/* 0x402335dd */
> s3   = -5.99202096e-01f,	/* 0xbf19654f */
> s4   =  8.10018554e-02f;	/* 0x3da5e44d */
> 
> static inline float
> __kernel_sinpif(float x)
> {
> 	double s;
> 	float p, x2;
> 	x2 = x * x;
> 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
> 	s = x * (x2 * (double)p + s0hi + s0lo);
> 	return ((float)s);
> }
> 

Well, starting with above and playing the splitting game
with x, x2, and p.  I've manage to reduce the max ULP in
the region testd.  Testing against MPFR with sin(pi*x)
computed in 5*24-bit precision gives

         MAX ULP: 0.73345101
    Total tested: 33554427
0.7 < ULP <= 0.8: 90
0.6 < ULP <= 0.7: 23948

Exhaustive testing with my older sinpi(x) as the reference
gives

./testf -S -m 0x1p-14 -M 0.25 -d -e
         MAX ULP: 0.73345101
    Total tested: 100663296
0.7 < ULP <= 0.8: 45
0.6 < ULP <= 0.7: 11977

The code is slightly slower than my current best kernel.
sinpif time is 0.0147 usec per call (60 cycles).

static inline float
__kernel_sinpif(float x)
{
	float p, phi, x2, x2hi, x2lo, xhi, xlo;
	uint32_t ix;

	x2 = x * x;
	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;

	GET_FLOAT_WORD(ix, p);
	SET_FLOAT_WORD(phi, (ix >> 14) << 14);

	GET_FLOAT_WORD(ix, x2);
	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);

	x2lo = s0lo + x2 * (p - phi) + (x2 - x2hi) * phi;
	x2hi *= phi;

	GET_FLOAT_WORD(ix, x);
	SET_FLOAT_WORD(xhi, (ix >> 14) << 14);
	xlo = x - xhi;
	xlo = xlo * (x2lo + x2hi) + (xlo * s0hi + xhi * x2lo);

	return (xlo + xhi * x2hi + xhi * s0hi);
}

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow