Implementation of half-cycle trignometric functions

Fri Apr 28 20:15:24 UTC 2017

On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
> 
> > On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote:
> >> On Thu, 27 Apr 2017, Steve Kargl wrote:
> >>> For those curious about testing, here are some numbers
> >>> for the Estrin's method.  These were obtained on an AMD
> >>> FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
> >>> and the number in parentheses are estimated cycles.
> >>>
> >>>            |    cospi     |    sinpi     |    tanpi
> >>> ------------+--------------+--------------+--------------
> >>> float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
> >>> double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
> >>> long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
> >>> ------------+--------------+--------------+--------------
> >>>
> >>> The timing tests are done on the interval [0,0.25] as
> >>> this is the interval where the polynomial approximations
> >>> apply.  Limited accuracy testing gives
> >>
> >> These still seem slow.  I did a quick test of naive evaluations of
> >> these functions as standard_function(Pi * x) and get times a bit faster
> >> on Haswell, except 2-4 times faster for the long double case, with the
> >> handicaps of using gcc-3.3.3 and i386.  Of course, the naive evaluation
> >> is inaccurate, especially near multiples of Pi/2.
> >
> > The slowness comes from handling extra precision arithmetic.  For
> > sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
> > approximation and x2 = x * x.  One needs to take care in the ordering
> > on the evaluation where x and x2 are split into high and low parts.
> > See the source code posted in response to your other email.
> 
> See logl() for how to do this almost as well as possible using the 2sum
> primitives.  Don't forget to test on i386 with i387, where you will need
> lots of slow STRICT_ASSIGN()s or careful use of float_t and double_t
> as in my uncomitted logf() and logl()i or more relevantly clog() (cplex.c)
> to avoid problems with extra precision (2sum requires this).
> 
> Calculating the polynomial as x * p(x) seems wrong.

It starts life as the Remes approximation for p(x) = sin(pi*x) / x.
Yes, I also looked into using p(x) = sin(pi*x) / (pi*x).  At some
point, one must do sinpi(x) = sin(pi*x) = x * p(x).

One could do, as is done with sin(x), 

f(x) = (sin(pi*x) - pi*x)/(pi*x)**3

then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds
the complication that pi*x must be computed in extra precision.
For float, with the algorithm I have developed, one can simply
do everything in double and then cast the result to float.
In fact, my __kernel_sinpif was initially written in double
with a max ULP less than 0.55.  I then looked at what portions
of the polynomial could be computed in float without reducing
the 0.55.  This then reveals what needs to be handle specially.

But, what is one to do with double and long double functions?

> That is too inaccurate even for sin(x), and probably slower too.

I'm not sure why you're claiming that it is too inaccurate.
Let's look at sinpi(x) and sin(x) on (roughly) the same
interval to remove the different precisions involved in 
sinpif(x) and sinf(x).

This is my sinpi(x)
troutmask:kargl[221] ./testd -S -m 0.00 -M 0.25 -n 24
         MAX ULP: 0.79851085
    Total tested: 16777214
0.7 < ULP <= 0.8: 1075
0.6 < ULP <= 0.7: 22722

This is libm sin(x)
troutmask:kargl[224] ./testd -S -m 0.00 -M 0.7853981634 -n 24
         MAX ULP: 0.72922199
    Total tested: 16777214
0.7 < ULP <= 0.8: 36
0.6 < ULP <= 0.7: 12193

The numbers above really aren't too different, and someone
that actually understands numerical analysis might be able
to take my code and (marginally) improve the numbers.

> > float
> > bde1(float x)
> > {
> >   return (sinf((float)M_PI * x));
> > }
> 
> sinpif() with range reduction in float precision and multiplication
> by M_PI and __kernel_sin() to finish off would be much faster and
> more accurate than doing everything in float precision.  On i386/i387,

Multiplication by M_PI and __kernel_sin() are double precision
intermediates.  That works for float.  What are you going to
do for double and long double?  One of the benefits of writing
sinpif(x) in float precision is that the exacts same algorithm is
used in sinpi(x) and sinpil(x).  One of these can be exhaustively
tested.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow