Implementation of half-cycle trignometric functions

Fri Apr 28 23:36:00 UTC 2017

On Sat, Apr 29, 2017 at 08:09:39AM +1000, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
> 
> > On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote:
> >> On Fri, 28 Apr 2017, Steve Kargl wrote:
> >>> ...
> >>> The slowness comes from handling extra precision arithmetic.  For
> >>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
> >>> approximation and x2 = x * x.  One needs to take care in the ordering
> >>> on the evaluation where x and x2 are split into high and low parts.
> >>> See the source code posted in response to your other email.
> >> ...
> >> Calculating the polynomial as x * p(x) seems wrong.
> >
> > It starts life as the Remes approximation for p(x) = sin(pi*x) / x.
> > Yes, I also looked into using p(x) = sin(pi*x) / (pi*x).  At some
> > point, one must do sinpi(x) = sin(pi*x) = x * p(x).
> 
> I forgot that the linear term needs to be multiplied by Pi.
> 
> Actually, the problem is only in the notation and the above description.
> s0 looks like a constant, but it is actually Pi*x in extra precision,
> and no extra precision is needed or done for x2.

I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.

static const float
s0hi =  3.14062500e+00f,	/* 0x40490000 */
s0lo =  9.67653585e-04f,	/* 0x3a7daa22 */
s1   = -5.16771269e+00f,	/* 0xc0a55de7 */
s2   =  2.55016255e+00f,	/* 0x402335dd */
s3   = -5.99202096e-01f,	/* 0xbf19654f */
s4   =  8.10018554e-02f;	/* 0x3da5e44d */

static inline float
__kernel_sinpif(float x)
{
	double s;
	float p, x2;
	x2 = x * x;
	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
	s = x * (x2 * (double)p + s0hi + s0lo);
	return ((float)s);
}

x2 and p are computed in float but is accumulates the final result
in double.  Using a higher precision works for float, but double
and long double becomes problematic. 

> __kernel_cospi() seems to be buggy.  It splits x2, but doesn't use
> extra precision for calculating x2.  Is extra precision really needed
> for the x2 term?

Yes.  The last part of __kernel_cospif() is a fused-multiple and
add.  For small |x2|, we can do a normal summation.  That's this
block of code

	GET_FLOAT_WORD(ix, x2);
	if (ix < 0x3c800000)		/* |x2| < 0x1p-6 */
		return (c0 + c * x2);

If |x2| > 0x1p-6, then c0 + c * x2 is computed with a FMA, but
its a little trickier.  This splits c in clo and chi, and adds
in a correction for c1

	GET_FLOAT_WORD(ix, c);
	SET_FLOAT_WORD(chi, (ix >> 14) << 14);
	clo = c - chi + c1lo;

In the final expression,

	c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0);

chi * c2hi is exact, and the parenthesis are required to achieve
max ULP < ~0.88.  The grouping of the other terms does not seem
to matter.

> It is like the x3 term for sin*(), with the critical
> difference that the coeffient is much larger (-1/2 instead of -1/6
> before multiplication by pi).  This requires extra accuracy for cos()
> too, but not for the calculating the term.  I optimized this by changing
> a slow fdlibm method to essentially 2sum.  The x2/2 term is inaccurate
> but no significant further inaccuracies occur for adding terms.  You
> have Pi*Pi*x2/2 which is more inaccurate.  I guess the extra half an
> ulp accuracy here is too much.  You avoid this half an ulp using extra
> precision, and seem to use lots of extra precision instead of the final
> 2sum step.  Probably much slower.
> 
> Here is my modified method:
> 
>  	x2 = x*x;	/* hopefully really don't need extra prec */
>  	(hi, lo) = split(x2);
>  	resid = S(x2);
>  	hi = C2_hi * hi;
>  	lo = C2_lo * hi + C1_hi * lo + C1_lo * lo;
>  	lo = lo + resid;
> #ifdef hopefully_not_needed
>  	Re-split hi+lo if lo is too large. 
> #endif
>  	return 3sum(1, hi, lo)
> 
> 3sum is in math_private.h and is supposed to be used instead of magic-
> looking open code.

I completely missed 3sum in math_private.h.  I did try using your
2sum and 2sumF macros, but did not recall seeing any difference
from my hand-rolled code.  I probably didn't spend enough time
looking into if I was using 2sum[F] properly.  I'll see if I can
adapt my code to use these macros.

>  Here it would reduce to the same expressions as in
> __kernel_sin().
> 
>  	1+hi = hi_1 + lo_1;	/* by hi-lo decomp. */
>  	return (lo + lo_1 + hi_1);	/* safely add the lo's and "carry" */
> 
> > One could do, as is done with sin(x),
> >
> > f(x) = (sin(pi*x) - pi*x)/(pi*x)**3
> >
> > then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds
> > the complication that pi*x must be computed in extra precision.
> 
> Isn't that what is done now?  Of course pi*x must be evaluated in
> extra precision, but hopefully no more is needed.  The x3 term
> would be much harder to do all in extra precision than the x2
> term for cospi().

Well, yes, I supposed that's one way to look at it.  In the approximation
of sinpif(x) = x * (s0 + x2 * S(x2)), the s0 term is nearly pi.  The
s1 coefficient, as determined by Remes, is s1 = -5.16771269, which
happens to be close to -pi**3/6 = -5.16771278 (i.e, next term in
McClauren series of sin(pi*x) = pi*x - (pi**3/6)*x**3 + ....  So, the
Remes coefficients have absorbed the factors of pi.

> >> That is too inaccurate even for sin(x), and probably slower too.
> >
> > I'm not sure why you're claiming that it is too inaccurate.
> 
> Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision.
> This gives at best an error of half an ulp for the part in parentheses
> and another half an ulp for the multiplication.  The final step
> must be an addition of a large term with a term at least 2 times
> smaller so that the error can be controlled (this depends on 1
> of the terms being exact).

Ah, okay, I understand what you're getting at.  I played a
number of games with grouping intermediate results and trying
to accumulate high and low parts.  You're seeing the best
that I can do.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow