Implementation of half-cycle trignometric functions

Fri Apr 28 18:40:26 UTC 2017

On Fri, 28 Apr 2017, Steve Kargl wrote:

> On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote:
>> On Thu, 27 Apr 2017, Steve Kargl wrote:

>>> For those curious about testing, here are some numbers
>>> for the Estrin's method.  These were obtained on an AMD
>>> FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
>>> and the number in parentheses are estimated cycles.
>>>
>>>            |    cospi     |    sinpi     |    tanpi
>>> ------------+--------------+--------------+--------------
>>> float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
>>> double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
>>> long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
>>> ------------+--------------+--------------+--------------
>>>
>>> The timing tests are done on the interval [0,0.25] as
>>> this is the interval where the polynomial approximations
>>> apply.  Limited accuracy testing gives
>>
>> These still seem slow.  I did a quick test of naive evaluations of
>> these functions as standard_function(Pi * x) and get times a bit faster
>> on Haswell, except 2-4 times faster for the long double case, with the
>> handicaps of using gcc-3.3.3 and i386.  Of course, the naive evaluation
>> is inaccurate, especially near multiples of Pi/2.
>
> The slowness comes from handling extra precision arithmetic.  For
> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
> approximation and x2 = x * x.  One needs to take care in the ordering
> on the evaluation where x and x2 are split into high and low parts.
> See the source code posted in response to your other email.

See logl() for how to do this almost as well as possible using the 2sum
primitives.  Don't forget to test on i386 with i387, where you will need
lots of slow STRICT_ASSIGN()s or careful use of float_t and double_t
as in my uncomitted logf() and logl()i or more relevantly clog() (cplex.c)
to avoid problems with extra precision (2sum requires this).

Calculating the polynomial as x * p(x) seems wrong.  That is too inaccurate
even for sin(x), and probably slower too.  __kernel_sin(x) uses
x + x2*S(x2), with complications except in the first quadrant needed to
improve accuracy.  The arrangement of the second term is dubious too,
even for sin(x).  It unimproves acuracy but is apparently used because
it is faster.  I would try S(x2) * x4 + x2 * (x * S3) + x, where
x2 * (x * S3) + x must all be evaluated in extra precision.

>>> x in [0,0.25]   |   tanpif   |   tanpi    |   tanpil
>>> -----------------+------------+------------+-------------
>>>         MAX ULP | 1.37954760 | 1.37300168 | 1.38800823
>>
>> Just use the naive evaluation to get similar errors in this
>> range.  It is probably faster too.
>
> I haven't checked tanpif, but naive evaluation is faster
> for sinpif(x).  But getting the worry answer fast seems
> to be counter to what one may expect from a math library.
> Consider the two naive implementations:

But the above is inaccurate too.

> float
> bde1(float x)
> {
>   return (sinf((float)M_PI * x));
> }

sinpif() with range reduction in float precision and multiplication
by M_PI and __kernel_sin() to finish off would be much faster and
more accurate than doing everything in float precision.  On i386/i387,
it could be arranged that naive users get M_PI and __kernel_sin()
almost automatically.  This is essentially what is already done for
sinf().  Similarly for sinpi() on arches with extra precision for double.

> float
> bde2(float x)
> {
>   return (sinf((float)(M_PI * x)));
> }
>
> float
> bde3(float x)
> {
>   return ((float)sin(M_PI * x));
> }
>
> x in [0,0.25]    |   sinpif    |    bde1(x)  |    bde2(x)   |   bde3(x)
> -----------------+-------------+-------------+--------------+------------
> Time usec (cyc)  | 0.0130 (54) | 0.0084 (35) | 0.0093 (38)  | 0.0109 (45)
>         MAX ULP | 0.80103672  | 1.94017851  | 1.46830785   | 0.5
> 1.0 < ULP        | 0           | 736496      | 47607        | 0
> 0.9 < ULP <= 1.0 | 0           | 563122      | 101162       | 0
> 0.8 < ULP <= 0.9 | 1           | 751605      | 386128       | 0
> 0.7 < ULP <= 0.8 | 727         | 942349      | 753647       | 0
> 0.6 < ULP <= 0.7 | 19268       | 1169719     | 1123518      | 0
>
> x in [3990,3992] |   sinpif    |   bde1(x)   |   bde2(x)   |   bde3(x)
> -----------------+-------------+-------------+-------------+------------
> Time usec (cyc)  | 0.0161 (66) | 0.0137 (56) | 0.0144 (59) | 0.0211 (87)
>         MAX ULP | 0.65193975  | 10458803.   | 6471461.    | 0.5
> 1.0 < ULP        | 0           | 16773117    | 16762878    | 0
> 0.9 < ULP <= 1.0 | 0           | 0           | 0           | 0
> 0.8 < ULP <= 0.9 | 0           | 0           | 0           | 0
> 0.7 < ULP <= 0.8 | 0           | 0           | 0           | 0
> 0.6 < ULP <= 0.7 | 19268       | 2047        | 2047        | 0
>
> So bde3(x) is best, but if you're going to use sin(M_PI*x) for
> float sinpif, then use sinpi((double)x) which is faster than bde3
> and just as accurate.

That is a naive version, with explicit pesssimation and destruction
of extra precision by casting to float.

A fast version would do return (__kernel_sindf(M_PI * x)) after
reducing the range of x (switch to +- and cosf for other quadrants).
Perhaps optimize the range reduction for small quadrant numbers for
sinf().  It is easier to subtract 1 from x than to subtract Pi/2
to adjust the quadrant, so this optimization might not be so useful.

Bruce