Implementation of half-cycle trignometric functions

Sat May 13 22:30:39 UTC 2017

On Sat, 13 May 2017, Steve Kargl wrote:

> On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
>> ...
>> Using the standard kernels is easy and works well:
>
> Maybe works well.  See below.
>> ...
>> Anywyay, it looks like the cost of using the kernel is at most 8-9
>> in the parallel case and at most 30 in the serial case.  The extra-
>> precision code has about 10 dependent instructions, so it s is
>> doing OK to take 30.

Probably a few more than 8.

I got nowhere using inline versions for double precision.  Apparently
the only large win for inlining is when it avoids repeating the
classification, as it does for e_rem_pio2*.  The kernels don't repeat
anything, so the only cost a function call, plus a few cycles for
testing iy for __kernel_sin() only.

> Based on other replies in this email exchange, I have gone back
> and looked at improvements to my __kernel_{cos|sin|tan}pi[fl]
> routines.  The improvements where for both accuracy and speed.

I really don't want another set of kernels (or more sets for degrees
instead of radians, and sincos).  Improvements to the existing kernels
are welcome, but difficult except for long double precision.  I got
nowhere tweaking the polynominal in __kernel_sin().  Every change that
that I tried just moved the tradeoff between accuracy and efficiency.
The one best for efficiency is only about 4 cycles faster, and increases
the error by 0.1 to 0.2 ulps.  This change involves adding up the terms
in a different order.

> I have tested on i686 and x86_64 systems with libm built with
> -O2 -march=native -mtune=native.  My timing loop is of the
> form
>
>        float dx, f, x;
>        long i, k;
>
>        f = 0;
>        k = 1 << 23;
>        dx = (xmax - xmin) / (k - 1);
>        time_start();
>        for (i = 0; i < k; i++) {
>                x = xmin + i * dx;

This asks for a conversions from long to double which tends to be slow, and
a multiplication in the inner loop.  The compiler shouldn't optimize it to
x += dx since this has different inaccuracy.

My test loop does x += dx with FP an test that x < limit.  This sometimes
has problems when dx is so small that x + dx == x.  Also, x, dx and limit
are double precision for testing all precision, so that the loop overhead
is the same for all precisions.  This works best on i386/i387.  Otherwise,
there are larger conversion overheads.  This usually prevents x + dx == x
in float precision, but in long double precison it results in x + dx == x
more often.  Double precision just can't handle a large limit like
LDBL_MAX or even small steps up to DBL_MAX.

>                f += cospif(x);
>        };
>        time_end();
>
>        x = (time_cpu() / k) * 1.e6;
>        printf("cospif time: %.4f usec per call\n", x);
>
>        if (f == 0)
>                printf("Can't happen!\n");

Otherwise, this is a reasonable throughput test.  But please count times
in cycles if possible.  rdtsc() is very easy to use on x86.

>
> The assumption here is that loop overhead is the same for
> all tested kernels.

It is probably much larger for long double precision.  I get minimal times
like 9 cycles for float and double precision, but more like 30 for long
double on x86.

> Test intervals for kernels.
>
> float: [0x1p-14, 0.25]
> double: [0x1p-29, 0.25]
>  ld80: [0x1p-34, 0.25]
>
>   Core2 Duo T7250 @ 2.00GHz      || AMD FX8350 Eight-Core CPU
>    (1995.05-MHz 686-class)       ||  (4018.34-MHz K8-class)
> ----------------------------------++--------------------------
>       | Horner | Estrin | Fdlibm || Horner | Estrin | Fdlibm
> -------+--------+--------+--------++--------+--------+--------
> cospif | 0.0223 |        | 0.0325 || 0.0112 |        | 0.0085
> sinpif | 0.0233 | Note 1 | 0.0309 || 0.0125 |        | 0.0085
> tanpif | 0.0340 |        | Note 2 || 0.0222 |        |

The fdlibm kernels are almost impossible to beat in float precision,
since they use double precision so the correct way to use them is
for example 'cospif: return __kernel_cosdf(M_PI * x);' after reduction
to |x| ~< 0.25  Any pure float precision method is going to take 10-20
cycles longer.

It is interesting that you measured fdlibm to be faster on the newer
system but much slower on the older system.  The latter must be a bug
somewhere.

> -------+--------+--------+--------++--------+--------+--------
> cospi  | 0.0641 | 0.0571 | 0.0604 || 0.0157 | 0.0142 | 0.0149
> sinpi  | 0.0722 | 0.0626 | 0.0712 || 0.0178 | 0.0161 | 0.0166
> tanpi  | 0.1049 | 0.0801 |        || 0.0323 | 0.0238 |
> -------+--------+--------+--------++--------+--------+--------

Now the differences are almost small enough to be noise.

> cospil | 0.0817 | 0.0716 | 0.0921 || 0.0558 | 0.0560 | 0.0755
> sinpil | 0.0951 | 0.0847 | 0.0994 || 0.0627 | 0.0568 | 0.0768
> tanpil | 0.1310 | 0.1004 |        || 0.1005 | 0.0827 |
> -------+--------+--------+--------++--------+--------+--------

Now the differences are that the kernels for long double precision
are unoptimized.  They use Horner.  Actually, they do use the optimization
of using double precision constants if possible (but not the larger
optimization for sparc64 of calculating higher terms in double precision).

> Time in usec/call.
>
> Note 1.  In re-arranging the polynomials for Estrin's method and
> float, I found appreciable benefit.

Do you mean "no appreciable benefit"?  No times are shown.  Short polynomials
benefit less.  There is also the problem that measuring throughput vs
latency is hard.  If the CPU can execute several functions in parallel, it
is best (iff the load has candidates for such functions, as simple tests
do) to use something like Horner's method to minimise the number of
operations.  Horner's method is only very bad for latency, and on in-order
CPUs.  Some of the timing anomalys are probably explained by this -- newer
CPUs have fewer bottlenecks so do better at executing functions in parallel;
this is also easier in float precision.

> Note 2.  I have been unable to use the tan[fl] kernels to implement
> satisfactory kernels for tanpi[fl].  In particular, for x in [0.25,0.5]
> and using tanf kernel leads to 6 digit ULPs in 0.5 whereas my kernel
> near 2 ULP.

The tanf kernel should be very accurate since it is in double precision.
But its polynomial is chosen to only give an accuracy of 0.7999 ulps,
while the polys for cosf and sing are chosen to give an accuracy of
0.5009 ulps, since the high accuracy is only almost free for the latter.
Any extra error on 0.7999 might be too much.  But multiplication by M_PI
in double precision shouldn't change the error by more than 0.0001 ulps.

The tanl kernel has to struggle to get even sub-ulp precision.  Its
degree is too high for efficiency, and I don't trust it to give even
sub-ulp precision, especially for ld128.

I didn't manage to get cospi(x) and sinpi(x) using the kernels as fast
as cos(x) and sin(x), even with |x| restricted to < 0.25 so that the
range reduction step is null.  The extra precision operations just take
longer than the range reduction even when the latter is not simplifed
for the reduced range.

Conversion of degrees to multiples of Pi is interesting.  E.g.,
cosd(x) = cos(x * Pi / 180) = cospi(x / 180) in infinite precision.
The natural way to implement it is to convert to cospi() first.
This is only easy using a remainder operation.  Remainder operations
work for this, unlike for converting radians to a quadrand plus a
remainder, because 180 is exactly representable but Pi isn't.  But
exact remainder operations are slow too.  They are just not as slow
or inexact as ones for 18000+ digit approximations to Pi.  So cosd(x)
can only be implemented much more efficiently than cos(x) for the
unimportant case of large |x|.

Bruce