Implementation of half-cycle trignometric functions

Wed Apr 26 03:23:58 UTC 2017

On Tue, Apr 25, 2017 at 06:27:52PM +1000, Bruce Evans wrote:
> On Mon, 24 Apr 2017, Steve Kargl wrote:
> 
> > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:
> 
> >> [for expl]
> >> This is arranged to look a bit like Horner's method at first (operations
> >> not written in separate statements), but it is basically Estrin's method
> >> with no extensions to more complicated sub-expressions than Estrin's,
> >> but complications for accuracy:
> >> - I think accuracy for the B3 term is unimportant, but for some reason
> >>    (perhaps just the one in the comment) it is grouped separately.  Estrin
> >>    might use x3 * (B3.e + x * B4.e).
> >
> > Yes, it is most likely an accuracy issue.  I found something
> > similar with sinpi/cospi.  My k_sinpi.c has
> >
> > #if 0
> >        double x2;
> >        /* Horner's method. */
> >        x2 = x * x;
> >        sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
> >            + s2) + s1;
> 
> Horner's of course has about as many dependencies as operations (13), so
> on modern x86 this is limited by the latency of 3-5 cycles/instructions
> or 39-65 cycles.  Multiplications take an extra cycle in more cases, so
> it is better to have more additions, but unfortunately Horner gives 1
> more multuplication.
> 
> With no dependencies, the limit would be throughput -- 2 instructions/cycle
> of 6.5 cycles rounded up to 7; then add latency for the last operation to
> get about 11.
> 
> With practical dependencies, it is hard to get more than a 2-fold speedup.
> 
> > #else
> >        double x2, x4;
> >        /* Sort of Estrin's method. */
> >        x2 = x * x;
> >        x4 = x2 * x2;
> >        sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
> >            + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
> > #endif
> 
> This looks OK.  The only obvious improvement is to try grouping x4*x4
> so that it can be evaluated separately.  This reduces the length of
> dependency chains by 1, but increases early pressure on multiplication.
> Don't write this as x8 = x4*x4.  Compilers will do the equivalent.
> You can also replace x4 by (x2*x2) but there are many x4's so this
> might be less readable.

The grouping of (x4*x4) and (x2 * x4) in other polynomials indeed
give small improvements.  I've reworked the approximations in 
sinpi[l], cospi[l], and tanpi[l].  I did not see any improves with
the float versions as the order of the polynomials are fairly small
(order of 4).

> > The addition of s1 must occur last, or I get greater 1 ULP
> > for a number of values.
> 
> Indeed.  This prevents using the best order, which is the opposite.
> You used the best order for the other inner terms, but not so obviously
> for the first outer addition.  However, the compiler is free to reorder
> that: consider:
> 
>  	T2 + T1 + T0
> 
> T2 is more complicated and everything depends on it, so should be started
> first, but possibly T1 can be started first because it is simpler.  Here
> it is not significantly simpler.  However, for:
> 
>  	T3 + T2 + T1 + T0
> 
> the natural left-to-right grouping is usually pessimial.  This should be
> written as:
> 
>  	T3 + (T2 + T1) + T0
> 
> Unfortunately, we can't write this naturally and usually optimally as:
> 
>  	T0 + T1 + T2 + T3
> 
> due to the accuracy problem with T0.  Maybe right this as:
> 
>  	T1 + T2 + T3 + T0	// only T0 in unnatural order
> 
> For expl, I tried all orders.

I tried a few different orders; particularly with the long double
functions.  I also managed to break the code a few times.  Thankfully,
I have everything in a subversion repository. 

> 
> > My timing show on Intel core2 duo
> >
> > /* Horner's methods. */
> > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
> > sinpi time: 0.066 usec per call
> > cospi time: 0.063 usec per call
> >
> > /* Estrin's methods. */
> > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
> > sinpi time: 0.059 usec per call
> > cospi time: 0.055 usec per call
> >
> > Here, -n 22 means to test 2**22 - 1 values in the interval
> > [0,0.25].
> 
> The improvement is marginal.   Only worth doing if it doesn't complicate
> the code much.  Older CPUs can't exploit te parallelism so well, so I
> would expect a relatively larger improvement.  Don't measure in usec,
> since they are hard to scale.  0.66 usec and 1.86 GHz is more than 100
> cycles.  This is quite slow.  I get similar times for cos and sin on an
> older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long
> double).  Newer CPUs are about 4 times faster in double prec (only 2
> times fast in float prec) in cycles despite having similar nominal
> timing, by unclogging their pipes.

I'm assuming your 0.66 in the above is a typo as I wrote 0.066.  I also
got some improvement by changing the splitting of a few entities into
high and low parts by a simple cast.  Instead of doing

	EXTRACT_WORDS(hx, lx, x);
	INSERT_WORDS(xhi, hx, 0);
	xlo = x - xhi;

I do

    xhi = (float)x;
    xlo = x - xhi;

but this seems to only help in some situation.  My new timing for
x in [0,0.25] on my 1995.05 MHz Core2 duo gives

Horner
sinpi time: 0.073 usec per call  (~146 cycles)
cospi time: 0.070 usec per call  (~140)
tanpi time: 0.112 usec per call  (~224)

Estrin
sinpi time: 0.064 usec per call  (~128)
cospi time: 0.060 usec per call  (~120)
tanpi time: 0.088 usec per call  (~176)

tanpi benefits the most as it has an order 16 polynomial, which
allows some freedom in group terms.

I don't think I can do much better.  For example, the approximation
for sinpi(x) is

sinpi(x) = x * (s0 + x2 * S(x2))

where S(x2) is the polynomial and I need to split x2 into high
and low parts and s0 is split.  The kernel with the Estrin method
and with the polynomial coefficients is

static inline double
__kernel_sinpi(double x)
{
	double hi, lo, sm, xhi, xlo;
	uint32_t hx, lx;

	double x2, x4;
	/* Sort of Estrin's method. */
	x2 = x * x;
	x4 = x2 * x2;
	sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4)
	    + (s2 + s3 * x2 + s4 * x4) * x2 + s1;

	EXTRACT_WORDS(hx, lx, x);
	INSERT_WORDS(xhi, hx, 0);
	xlo = x - xhi;
	lo = xlo * (x + xhi) * sm;
	hi = xhi * xhi * sm;
	lo = x * lo + xlo * (s0hi + s0lo) + x * hi;

	return (lo + xhi * s0lo + xhi * s0hi);
}

The order of terms in the last 5 lines matters.  Testing yields

        a =  2.4220192246482908e-01, /* 0x3fcf0078, 0xfc01e3f0 */
 sinpi(a) =  6.8957335930938313e-01, /* 0x3fe610fc, 0x264da72e */
sin(pi*a) =  6.8957335930938324e-01, /* 0x3fe610fc, 0x264da72f */
ULP: 0.79950446
         MAX ULP: 0.79950446
    Total tested: 8388606
1.0 < ULP       : 0
0.9 < ULP <= 1.0: 0
0.8 < ULP <= 0.9: 0
0.7 < ULP <= 0.8: 554
0.6 < ULP <= 0.7: 11275

> We should also try to use the expression A*x + B more often so that
> things with be faster when this is done by fma, but this happens
> naturally anyway.

Using A*x+B or B+A*x did not seem to matter.

-- 
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow