Implementation of half-cycle trignometric functions

Bruce Evans brde at optusnet.com.au
Fri Apr 28 22:09:46 UTC 2017

On Fri, 28 Apr 2017, Steve Kargl wrote:

> On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote:
>> On Fri, 28 Apr 2017, Steve Kargl wrote:
>>> ...
>>> The slowness comes from handling extra precision arithmetic.  For
>>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
>>> approximation and x2 = x * x.  One needs to take care in the ordering
>>> on the evaluation where x and x2 are split into high and low parts.
>>> See the source code posted in response to your other email.
>> ...
>> Calculating the polynomial as x * p(x) seems wrong.
>
> It starts life as the Remes approximation for p(x) = sin(pi*x) / x.
> Yes, I also looked into using p(x) = sin(pi*x) / (pi*x).  At some
> point, one must do sinpi(x) = sin(pi*x) = x * p(x).

I forgot that the linear term needs to be multiplied by Pi.

Actually, the problem is only in the notation and the above description.
s0 looks like a constant, but it is actually Pi*x in extra precision,
and no extra precision is needed or done for x2.

__kernel_cospi() seems to be buggy.  It splits x2, but doesn't use
extra precision for calculating x2.  Is extra precision really needed
for the x2 term?  It is like the x3 term for sin*(), with the critical
difference that the coeffient is much larger (-1/2 instead of -1/6
before multiplication by pi).  This requires extra accuracy for cos()
too, but not for the calculating the term.  I optimized this by changing
a slow fdlibm method to essentially 2sum.  The x2/2 term is inaccurate
but no significant further inaccuracies occur for adding terms.  You
have Pi*Pi*x2/2 which is more inaccurate.  I guess the extra half an
ulp accuracy here is too much.  You avoid this half an ulp using extra
precision, and seem to use lots of extra precision instead of the final
2sum step.  Probably much slower.

Here is my modified method:

x2 = x*x;	/* hopefully really don't need extra prec */
(hi, lo) = split(x2);
resid = S(x2);
hi = C2_hi * hi;
lo = C2_lo * hi + C1_hi * lo + C1_lo * lo;
lo = lo + resid;
#ifdef hopefully_not_needed
Re-split hi+lo if lo is too large.
#endif
return 3sum(1, hi, lo)

3sum is in math_private.h and is supposed to be used instead of magic-
looking open code.  Here it would reduce to the same expressions as in
__kernel_sin().

1+hi = hi_1 + lo_1;	/* by hi-lo decomp. */
return (lo + lo_1 + hi_1);	/* safely add the lo's and "carry" */

> One could do, as is done with sin(x),
>
> f(x) = (sin(pi*x) - pi*x)/(pi*x)**3
>
> then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds
> the complication that pi*x must be computed in extra precision.

Isn't that what is done now?  Of course pi*x must be evaluated in
extra precision, but hopefully no more is needed.  The x3 term
would be much harder to do all in extra precision than the x2
term for cospi().

>> That is too inaccurate even for sin(x), and probably slower too.
>
> I'm not sure why you're claiming that it is too inaccurate.

Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision.
This gives at best an error of half an ulp for the part in parentheses
and another half an ulp for the multiplication.  The final step
must be an addition of a large term with a term at least 2 times
smaller so that the error can be controlled (this depends on 1
of the terms being exact).

> ...
>>> float
>>> bde1(float x)
>>> {
>>>   return (sinf((float)M_PI * x));
>>> }
>>
>> sinpif() with range reduction in float precision and multiplication
>> by M_PI and __kernel_sin() to finish off would be much faster and
>> more accurate than doing everything in float precision.  On i386/i387,
>
> Multiplication by M_PI and __kernel_sin() are double precision
> intermediates.  That works for float.

That's what I mean by naive code having a chance of working.  It
takes foot-shooting to reduce M_PI to float.  Unfortunately, the
API + ABI does reduce (M_PI * x) to float in sinf(M_PI * x) if
sinf() is extern.  But if sinf() is in hardware, the compiler
might not destroy the extra precision before or after using the
hardware.

> What are you going to
> do for double and long double?  One of the benefits of writing
> sinpif(x) in float precision is that the exacts same algorithm is
> used in sinpi(x) and sinpil(x).  One of these can be exhaustively
> tested.

That is about the only benefit of the float functions, so I resist
optimizing them using doubles.  It would be better to use extra
precision automatically, but not force it.

Bruce