Implementation of half-cycle trignometric functions

Steve Kargl sgk at troutmask.apl.washington.edu
Fri Apr 28 16:56:59 UTC 2017


On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote:
> On Thu, 27 Apr 2017, Steve Kargl wrote:
> 
> > On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote:
> >>
> >> I have attached a new diff to the bugzilla report.  The
> >> diff is 3090 lines and won't be broadcast the mailing list.
> >>
> >> This diff includes fixes for a few inconsequential bugs
> >> and implements modified Estrin's method for sum a few
> >> ploynomials.  If you want the previous Horner's method
> >> then add -DHORNER to your CFLAGS.
> >
> > For those curious about testing, here are some numbers
> > for the Estrin's method.  These were obtained on an AMD
> > FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
> > and the number in parentheses are estimated cycles.
> >
> >            |    cospi     |    sinpi     |    tanpi
> > ------------+--------------+--------------+--------------
> > float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
> > double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
> > long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
> > ------------+--------------+--------------+--------------
> >
> > The timing tests are done on the interval [0,0.25] as
> > this is the interval where the polynomial approximations
> > apply.  Limited accuracy testing gives
> 
> These still seem slow.  I did a quick test of naive evaluations of
> these functions as standard_function(Pi * x) and get times a bit faster
> on Haswell, except 2-4 times faster for the long double case, with the
> handicaps of using gcc-3.3.3 and i386.  Of course, the naive evaluation
> is inaccurate, especially near multiples of Pi/2.

The slowness comes from handling extra precision arithmetic.  For
sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
approximation and x2 = x * x.  One needs to take care in the ordering
on the evaluation where x and x2 are split into high and low parts.
See the source code posted in response to your other email.  

> > x in [0,0.25]   |   tanpif   |   tanpi    |   tanpil
> > -----------------+------------+------------+-------------
> >         MAX ULP | 1.37954760 | 1.37300168 | 1.38800823
> 
> Just use the naive evaluation to get similar errors in this
> range.  It is probably faster too.

I haven't checked tanpif, but naive evaluation is faster
for sinpif(x).  But getting the worry answer fast seems
to be counter to what one may expect from a math library.
Consider the two naive implementations:

float
bde1(float x)
{
   return (sinf((float)M_PI * x));
}

float
bde2(float x)
{
   return (sinf((float)(M_PI * x)));
}

float
bde3(float x)
{
   return ((float)sin(M_PI * x));
}

x in [0,0.25]    |   sinpif    |    bde1(x)  |    bde2(x)   |   bde3(x)
-----------------+-------------+-------------+--------------+------------
Time usec (cyc)  | 0.0130 (54) | 0.0084 (35) | 0.0093 (38)  | 0.0109 (45)
         MAX ULP | 0.80103672  | 1.94017851  | 1.46830785   | 0.5
1.0 < ULP        | 0           | 736496      | 47607        | 0
0.9 < ULP <= 1.0 | 0           | 563122      | 101162       | 0
0.8 < ULP <= 0.9 | 1           | 751605      | 386128       | 0
0.7 < ULP <= 0.8 | 727         | 942349      | 753647       | 0
0.6 < ULP <= 0.7 | 19268       | 1169719     | 1123518      | 0

x in [3990,3992] |   sinpif    |   bde1(x)   |   bde2(x)   |   bde3(x)
-----------------+-------------+-------------+-------------+------------
Time usec (cyc)  | 0.0161 (66) | 0.0137 (56) | 0.0144 (59) | 0.0211 (87)
         MAX ULP | 0.65193975  | 10458803.   | 6471461.    | 0.5
1.0 < ULP        | 0           | 16773117    | 16762878    | 0
0.9 < ULP <= 1.0 | 0           | 0           | 0           | 0
0.8 < ULP <= 0.9 | 0           | 0           | 0           | 0
0.7 < ULP <= 0.8 | 0           | 0           | 0           | 0
0.6 < ULP <= 0.7 | 19268       | 2047        | 2047        | 0

So bde3(x) is best, but if you're going to use sin(M_PI*x) for
float sinpif, then use sinpi((double)x) which is faster than bde3
and just as accurate.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow


More information about the freebsd-numerics mailing list