Implementation of half-cycle trignometric functions

Bruce Evans brde at optusnet.com.au
Tue Apr 25 05:30:44 UTC 2017

On Mon, 24 Apr 2017, Steve Kargl wrote:

> I have what appears to be working versions of asinpi[fl].  It
> was suggested elsewhere that using an Estrin's method to
> sum the polynomial approximations instead of Horner's method
> may allow modern CPUs to better schedule generated code.
> I have implemented an Estrin's-like method for sinpi[l]
> and cospi[l], and indeed the generated code is faster on
> my Intel core2 duo with only a slight degradation in the
> observed max ULP.  I'll post new versions to bugzilla in
> the near future.

I didn't remember Estrin, but checked Knuth and see that his
method is a subset of what I use routinely.  I might have learned
it from Knuth.  I tried all methods in Knuth, but they didn't seem
to be any better at first.  Later I learned more about scheduling.
Estrin's method now seems routine as a subset of scheduling.

You already used it in expl: for ld80:

X 		x2 = x * x;
X 		x4 = x2 * x2;

Unlike in Knuth's description of Estrin, we don't try to order the
operations to suit the CPU(s) in the code, but just try to keep them
independent, as required for them to execute independently.  Compilers
will rearrange them anyway, sometimes even correctly, but it makes
little difference (on OOE CPUs) to throw all of the operations into
the pipeline and see how fast something comes out).

X 		q = x4 * (x2 * (x4 *
X 		    /*
X 		     * XXX the number of terms is no longer good for
X 		     * pairwise grouping of all except B3, and the
X 		     * grouping is no longer from highest down.
X 		     */
X 		    (x2 *            B12  + (x * B11 + B10)) +
X 		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +
X 			  (x * B5 +  B4.e)) + x2 * x * B3.e;

This is arranged to look a bit like Horner's method at first (operations
not written in separate statements), but it is basically Estrin's method
with no extensions to more complicated sub-expressions than Estrin's,
but complications for accuracy:
- I think accuracy for the B3 term is unimportant, but for some reason
(perhaps just the one in the comment) it is grouped separately.  Estrin
might use x3 * (B3.e + x * B4.e).
- accuracy is important for B0 through B2 terms, and we don't use Estrin
for them.  Efficiency is a matter of running the above in parallel
with the evaluation of lower terms.  The above shouldn't steal any
resources from evaluation of the lower terms unless it is the slow
part.  Normally you want N (2 independent streams to complete at the

The above also does some power-of-2 grouping which has been messed up
by changing the number of terms in the polynomial (see the comment).
Generally this costs time time of 1 operation for each missed optimization.

The above hasn't been rearranged since it is a lot of work to keep
rearranging it during development and development stalled after committing
it.  The ld128 cases has larger problems from this.  It has 2 sets of
polynomials, both higher order, with sub-optimal numbers of terms.

Here is my program for testing the results of quadratic grouping on x86.
It shows little qualitative difference between Athlon-XP and Haswell.
Haswell is just faster:

XX #include <stdlib.h>
XX
XX static __inline void
XX cpuid(int ax)
XX {
XX 	__asm __volatile("cpuid" : "=a" (ax) : "0" (ax) : "cx", "dx", "bx");
XX }
XX #define	puid(ax)
XX
XX double P0, P1, P2, P3;
XX double P4, P5, P6, P7;
XX double P8, P9, Pa, Pb;
XX double Pc, Pd, Pe, Pf;
XX double Q0, Q1, Q2, Q3;
XX double Q4, Q5, Q6, Q7;
XX double Q8, Q9, Qa, Qb;
XX double Qc, Qd, Qe, Qf;
XX volatile double a, r;
XX
XX #define	FREQ	2010168298	/* sysctl -n machdep.tsc_freq */
XX
XX #define	NITER	(FREQ / 10)
XX
XX int
XX main(int argc, char **argv)
XX {
XX 	double x;
XX 	int i, n;
XX
XX 	n = (argc == 1 ? 0 : atoi(argv));
XX 	switch (n) {
XX 	default:
XX 		for (i = 0; i < NITER; i++) {
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 2:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = x*P1+P0;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -4:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*P3));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 4:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = (x*P3+P2)*(x*x)+(x*P1+P0);
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 5:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t32 = x*P3+P2;
XX 			double t10 = x*P1+P0;
XX 			double x2 = x*x;
XX 			r = t32*x2+t10;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -8:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*P7))))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 6:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double lo = 0.0;
XX 			r = ((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) +
XX 			    ((x*P3+P2)*(x*x)+lo);
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 7:
XX 		for (i = 0; i < NITER; i++) {
XX 			double d = a;
XX 			double lo = 0.0;
XX 			double t = d*P7+P6;
XX 			double z = d*d;
XX 			double s = d*P5+P4;
XX 			double rl = d*P3+P2;
XX #if 0
XX 			r = lo + z * rl + z * z * s + z * z * (z * t);
XX #elif 0
XX 			r = lo + z * z * (z * t) + z * z * s + z * rl;
XX #elif 0
XX 			r = z * z * (z * t) + (z * z * s + lo) + z * rl;
XX #else
XX 			r = z * z * (z * t) + z * z * s + (z * rl + lo);
XX #endif
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 8:
XX 		for (i = 0; i < NITER; i++) {
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 9:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t76 = x*P7+P6;
XX 			double t54 = x*P5+P4;
XX 			double t32 = x*P3+P2;
XX 			double x2 = x*x;
XX 			r = ((t76    )*x2+(t54    ))*(x2*x2) +
XX 			    ((t32    )*x2+(x*P1+P0));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 10:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t76 = x*P7+P6;
XX 			double t32 = x*P3+P2;
XX 			double t54 = x*P5+P4;
XX 			double x2 = x*x;
XX 			r = ((t76    )*x2+(t54    ))*(x2*x2) +
XX 			    ((t32    )*x2+(x*P1+P0));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 11:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t76 = x*P7+P6;
XX 			double t54 = x*P5+P4;
XX 			double t32 = x*P3+P2;
XX 			double x2 = x*x;
XX 			double t7654 = t76*x2+t54;
XX 			double t3210 = t32*x2+(x*P1+P0);
XX 			r = t7654*(x2*x2) + t3210;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -16:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf)))))))))))))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 16:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = (((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) +
XX 			     ((x*Pb+Pa)*(x*x)+(x*P9+P8)))*
XX 			     (((x*x)*(x*x))*((x*x)*(x*x))) +
XX 			    (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) +
XX 			     ((x*P3+P2)*(x*x)+(x*P1+P0)));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 17:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double tfe = x*Pf+Pe;
XX 			double tba = x*Pb+Pa;
XX 			double t76 = x*P7+P6;
XX 			double t32 = x*P3+P2;
XX 			double x2 = x*x;
XX 			double tdc = x*Pd+Pc;
XX 			double t98 = x*P9+P8;
XX 			double t54 = x*P5+P4;
XX 			r = (((tfe    )*x2+(tdc    ))*(x2*x2) +
XX 			     ((tba    )*x2+(t98    )))*((x2*x2)*(x2*x2)) +
XX 			    (((t76    )*x2+(t54    ))*(x2*x2) +
XX 			     ((t32    )*x2+(x*P1+P0)));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 18:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double tfe = x*Pf+Pe;
XX 			double tba = x*Pb+Pa;
XX 			double t76 = x*P7+P6;
XX 			double tdc = x*Pd+Pc;
XX 			double x2 = x*x;
XX 			double t32 = x*P3+P2;
XX 			double t98 = x*P9+P8;
XX 			double t54 = x*P5+P4;
XX 			double t10 = x*P1+P0;
XX 			double x4 = x2*x2;
XX 			double tfedc = x2*tfe+tdc;
XX 			double tba98 = x2*tba+t98;
XX 			double t7654 = x2*t76+t54;
XX 			double t3210 = x2*t32+t10;
XX 			double tfedcba98 = x4*tfedc+tba98;
XX 			double t76543210 = x4*t7654+t3210;
XX 			r = (x4*x4)*tfedcba98+t76543210;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -32:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf+x*(Q0+x*(Q1+x*(Q2+x*(Q3+x*(Q4+x*(Q5+x*(Q6+x*(Q7+x*(Q8+x*(Q9+x*(Qa+x*(Qb+x*(Qc+x*(Qd+x*(Qe+x*(Qf)))))))))))))))))))))))))))))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 32:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = ((((x*Qf+Qe)*(x*x)+(x*Qd+Qc))*((x*x)*(x*x)) +
XX 			      ((x*Qb+Qa)*(x*x)+(x*Q9+Q8)))*
XX 			      (((x*x)*(x*x))*((x*x)*(x*x))) +
XX 			     (((x*Q7+Q6)*(x*x)+(x*Q5+Q4))*((x*x)*(x*x)) +
XX 			      ((x*Q3+Q2)*(x*x)+(x*Q1+Q0))))*
XX 			      ((((x*x)*(x*x))*((x*x)*(x*x)))*
XX 			       (((x*x)*(x*x))*((x*x)*(x*x)))) +
XX 			    ((((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) +
XX 			      ((x*Pb+Pa)*(x*x)+(x*P9+P8)))*
XX 			      (((x*x)*(x*x))*((x*x)*(x*x))) +
XX 			     (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) +
XX 			      ((x*P3+P2)*(x*x)+(x*P1+P0))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 33:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double Tfe = x*Qf+Qe;
XX 			double tfe = x*Pf+Pd;
XX 			double T76 = x*Q7+Q6;
XX 			double t76 = x*P7+P6;
XX 			double x2 = x*x;
XX 			r = ((((Tfe    )*x2+(x*Qd+Qc))*(x2*x2) +
XX 			      ((x*Qb+Qa)*x2+(x*Q9+Q8)))*((x2*x2)*(x2*x2)) +
XX 			     (((T76    )*x2+(x*Q5+Q4))*(x2*x2) +
XX 			      ((x*Q3+Q2)*x2+(x*Q1+Q0))))*(((x2*x2)*(x2*x2))*((x2*x2)*(x2*x2))) +
XX 			    ((((tfe    )*x2+(x*Pd+Pc))*(x2*x2) +
XX 			      ((x*Pb+Pa)*x2+(x*P9+P8)))*((x2*x2)*(x2*x2)) +
XX 			     (((t76    )*x2+(x*P5+P4))*(x2*x2) +
XX 			      ((x*P3+P2)*x2+(x*P1+P0))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	}
XX 	return (0);
XX }

It only tests power-of-2 number of terms.  With say 6 or 7 terms, it is
harder to run much faster than with 8 terms, but easy to try all other
resonable groupings.  With 8-15 terms, it is harder to try enough other
groupings.

makes a little difference with gcc-3.3, less than that with gcc-4.2 and
no difference with clang.  Looking at the generated code can be instructive.
The compiler sometimes finds a good order that you might not think of.
Also, you can see where it is constrained by dependencies.

Modern x86 (SSE-mumble or AVX-mumble) have the basic Estrin operation of
B0 + x * B1 as a single operation (v*fma* in AVX).  Basic Estrin would
reduce to Horner if this were used.  But the C standard doesn't really
allow using it, and compilers don't use it on x86.  gcc did use it on
ia64.  It also takes unportable CFLAGS to get high SSE/AVX support, and
gives unportable binaries when used.  Don't use the libm fma since that
tends to be 10-100 times slower than not using it (perhaps only 10
times slower if it claims to be fast).  This libm fma is only useful
for algorithm development and cases where efficiency doesn't matter.
Fma is about half an ulp more accurate than separate operations.

Bruce