Implementation of half-cycle trignometric functions

Bruce Evans brde at optusnet.com.au
Sat Apr 29 07:54:36 UTC 2017

On Fri, 28 Apr 2017, Steve Kargl wrote:

> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
>>
>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.

This is all rather over-engineered.  Optimizing these functions is
unimportant comparing with finishing cosl() and sinl() and optimizing
all of the standard trig functions better, but we need correctness.
But I now see many simplifications and improvements:

(1) There is no need for new kernels.  The standard kernels already handle
extra precision using approximations like:

sin(x+y) ~= sin(x) + (1-x*x/2)*y.

Simply reduce x and write Pi*x = hi+lo.  Then

sin(Pi*x) = __kernel_sin(hi, lo, 1).

I now see how to do the extra-precision calculations without any
multiplications.  For Pi*x, suppose that x is near 1/4.  Choose a
nice constant value x0 as far below 1/4 as works and reduce x
further using this:

x = x0 + (x - x0)		// plain '=' means exact
y0 ~= Pi*x0 = hi0 + lo0	// in extra prec at compile time
y ~= Pi*(x-x0)		// small, so doesn't need extra prec
Pi*x ~= hi0 + lo0 + y = hi1 + lo1	// use 2sum
sin(Pi*x) ~= __kernel_sin(hi1, lo1, 1).

but it is better to use a special polynomial approximation about y0 on y.
We have reduced x to the significantly smaller y, so the polynominal can
have fewer terms and not need any extra precision except for the additive
term.  In infinite precision, sin(y+y0) = S0 + S1*y + ...  Write
S0 = hi2+lo2 and choose x0 so that lo2 can be replaced by 0 without
significant loss of accuracy (see some exp2's for this method).  At
worst, lo2 is nonzero and we have to do just 1 extra-precision operation.
Otherwise, we have a normal polyomial evaluation.  Unlike for sin() near
0, there are even terms and and not nice fractional coefficients, but
since y is small there aren't so many terms.  It is important to avoid
needing extra precision for many of the coefficients.  Higher ones
don't need it since y is small, and S0 might not need it.

The Intel ia64 math library uses the reduction x = y+y0 at several
interval endpoints below Pi/4 for ordinary sin().  Previously, I
couldn't get this method to work as well as using the single large
interval [-Pi/4, Pi/4], since the even terms take a while to calculate
and I didn't understand the hi+lo decomposition methods so well.

Reminder on why cos(x) needs extra precison but sin(x) doesn't when
we use the brute force expansion up to Pi/4: we use cos(x) = 1 - x*x/2
+ ... and the only problem is to subtract these terms accurately enough.
There is a problem since Pi/4 is slightly larger than sqrt(1/2), so
x*x/2 can be slightly larger than 1/4.  If it were larger than 1/2,
then the subtraction would lose a full bit to cancelation, so extra
inaccuracy could be more than 1 ulp, but we want it to be less 0.5
ulps so that the combined inaccuracy is less than 1 ulp.  Similarly,
when x*x/2 is larger than 1/4 but below 1/2, the extra inaccuracy
can be larger than 0.5 ulps though smaller than 1 ulp.  Above 0.5
is still to high, so we need some extra precision.  This problem
doesn't affect sin(x) since the ratio of the terms is -x*x/6 so the
extra inaccuracy is 3 times smaller so below 1/3 ulps.

For cos(y+y0) ~= C0 + C1*y + C2*y*y, y must be small enough for the
extra inaccuracy from the first addition to be a little smaller tha
0.5 ulps.  This is not as easy as I was thinking, since the lineat
term is relatively large.  But there is no problem for cos(y+1/2):
cos(y+1/2) ~= 0.88 - 0.48*y, and the ratio of terms with y = Pi/4-1/2
is ~0.16.  However, sin(y+1/2) ~= 0.48 + 0.88*y, so the maximum
ratio of terms is ~0.53 -- too much.

>> static const float
>> s0hi =  3.14062500e+00f,	/* 0x40490000 */
>> s0lo =  9.67653585e-04f,	/* 0x3a7daa22 */
>> s1   = -5.16771269e+00f,	/* 0xc0a55de7 */
>> s2   =  2.55016255e+00f,	/* 0x402335dd */
>> s3   = -5.99202096e-01f,	/* 0xbf19654f */
>> s4   =  8.10018554e-02f;	/* 0x3da5e44d */
>>
>> static inline float
>> __kernel_sinpif(float x)
>> {
>> 	double s;
>> 	float p, x2;
>> 	x2 = x * x;
>> 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
>> 	s = x * (x2 * (double)p + s0hi + s0lo);
>> 	return ((float)s);
>> }

I don't like the style being completely different.  Why lower-case constants?
...

> Well, starting with above and playing the splitting game
> with x, x2, and p.  I've manage to reduce the max ULP in
> the region testd.  Testing against MPFR with sin(pi*x)
> computed in 5*24-bit precision gives

It already cheats by using (double)p, but is limited by using float to
caclulate p.  p should have an inaccuracy of not much more than 1 ulp,
and we don't care if this is a little over 1.  Then when we do the final
caclulations in either extra precision or using the double hack, we
don't get any more inaccuracy.  Extra precision is not very useful for
x2 * p -- it only increases the error by another half an ulp.  So errors
for that part are 0.5 ulps for x2, ~1 ulp for p and 0.5 for a float
multiplication -- 2.0 althogther -- and the the double multiplication
reduces this to 1.5.  This gets scaled down as usual by the term ratio
x*x/6 ~= 0.2.  0.2 of 1.5 = 0.3 is below the magic 0.5, but so is 0.2 of
2.0 = 0.4.  But I would prefer below the magic 0.25.

>         MAX ULP: 0.73345101
>    Total tested: 33554427
> 0.7 < ULP <= 0.8: 90
> 0.6 < ULP <= 0.7: 23948

You can even reverse this to calculate the maximum of the term ratio
fairly accuractely.  The rounding error is 0.233.  We had an error of
about 1.5 before the final addition.  The ratio 0.233/1.5 = 0.16 gives
the reduction ratio for the final addition.  It approximates the
term ratio.  (I'm surprised that it is lower than the term ratio
instead of higher.  In fact, I don't believe the 0.733.  According
to the term ratio, the worst case must be at least 0.800.)

> Exhaustive testing with my older sinpi(x) as the reference
> gives
>
> ./testf -S -m 0x1p-14 -M 0.25 -d -e
>         MAX ULP: 0.73345101
>    Total tested: 100663296
> 0.7 < ULP <= 0.8: 45
> 0.6 < ULP <= 0.7: 11977

Perhaps it really is 0.733.  I estimated an error of 1 ulp for p, but
its term ratio is much lower than for the first 2 terms -- 20 times
lower from the factorial in the denominator -- so 0.55 ulps would
be a better estimate.  This gives the estimate term_ratio * 1.05
= 0.21 extra ulps which matches 0.233 almost perfectly.  Now it seems
too perfect to be true -- I usually forget to add up all the half-
ulps in calculations like this.

> The code is slightly slower than my current best kernel.
> sinpif time is 0.0147 usec per call (60 cycles).

It still seems slow :-).  Full sin(x) takes 43 cycles on amd64
(freefall xeon) for x where its extra precision is used (iy != 0),
and calculating only Pi*x in extra precision shouldn't add much to
that, and the simpler range reduction should subtract a lot (sin(x)
takes 23 cycles below 2Pi where its range reduction is simple).

> static inline float
> __kernel_sinpif(float x)
> {
> 	float p, phi, x2, x2hi, x2lo, xhi, xlo;
> 	uint32_t ix;
>
> 	x2 = x * x;
> 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
>
> 	GET_FLOAT_WORD(ix, p);
> 	SET_FLOAT_WORD(phi, (ix >> 14) << 14);
>
> 	GET_FLOAT_WORD(ix, x2);
> 	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);

I expect that these GET/SET's are the slowest part.  They are quite fast
in float prec, but in double prec on old i386 CPUs compilers generate bad
code which can have penalties of 20 cycles per GET/SET.

Why the strange reduction?  The double shift is just a manual optimization
or pssimization (usually the latter) for clearing low bits.  Here it is
used to clear 14 low bits instead of the usual 12.  This is normally
written using just a mask of 0xffff0000, unless you want a different
number of bits in the hi terms for technical reasons.  Double precision
can benefit more from asymmetric splitting of terms since 53 is not
divisible by 2; 1 hi term must have less than 26.5 bits and the other term
can hold an extra bit.

Here is an example of fairly refined splitting and use of 2sum withit:

X float
X log10f(float x)
X {
X 	struct Float r;
X 	float_t hi, lo;
X 	float fhi;

hi+lo decompositions need float_t in general, for both correctness and
efficiency.

X 	int32_t hx;
X
X 	DOPRINT_START(&x);
X 	k_logf(x, &r);
X 	if (!r.lo_set)
X 		RETURNP(r.hi);
X 	if (sizeof(float_t) > sizeof(float))
X 		RETURNP((invln10_lo + (float_t)invln10_hi) * (r.lo + r.hi));
X 	_2sumF(r.hi, r.lo);
X 	fhi = r.hi;
X 	GET_FLOAT_WORD(hx, fhi);
X 	SET_FLOAT_WORD(fhi, hx & 0xfffff000);

It is hard to do the splitting better than this.  This method subtly
masks bugs if float_t is not used.  On i386/i387 it forces much the
same slow loads and stores that are needed to discard any extra
precision when float_t is not used.

X 	hi = fhi;
X 	lo = r.lo + (r.hi - hi);

The code tries to optimize things with careful assignments between float_t
and float.  This has further subtleties from float_t being used.  When there
is extra precision, float_t should be different from float, and compilers
should not be broken enough to elide conversions between float and float_t.
The above 2 lines are basic splitting, and the subtleties are what is needed
for this to work.  struct Float is just hi and lo in a struct, each with
type float, but a further subtlety is that you want the compiler to keep
all variables in registers, with floats in extended precision in the
registers; then it must be arranged that hi+lo decompositions in registers
have no extra precision...

X 	RETURN2P(invln10_hi * hi,
X 	    (invln10_lo + invln10_hi) * lo + invln10_lo * hi);
X }

... but when the hi+lo values are used in expressions like this, we
want the extra precision back.  Here the hi*hi term is exact (that is
the point of the decomposition), and extra precision gives some free
extra accuracy for the rest of the calculation.

>
> 	x2lo = s0lo + x2 * (p - phi) + (x2 - x2hi) * phi;
> 	x2hi *= phi;
>
> 	GET_FLOAT_WORD(ix, x);
> 	SET_FLOAT_WORD(xhi, (ix >> 14) << 14);
> 	xlo = x - xhi;
> 	xlo = xlo * (x2lo + x2hi) + (xlo * s0hi + xhi * x2lo);
>
> 	return (xlo + xhi * x2hi + xhi * s0hi);
> }

Yet another splitting is going to be slow.  This one is necessary.  Are
the first 2 really necessary?  The above analysis might show that they
really aren't -- didn't we get an error of 0.733 ulps (which seems too
low) using both methods?

Otherwise, this code is very similar to my log10f().  k_logf() does
most of the work.  It evaluates logf(x) in extra precision.  log10f(x)
is just C*logf(x) in extra precision.  k_logf() arranges to return
the extra precision.  For sinpi*(), you need to scale the arg instead
of the result.  This is actually easier, since the arg is not in extra
precision and the extra precision is very easy to handle in the kernel
(and even already handled in standard kernels).

The above logf() runs acceptably fast using a kernel not especially
optimized for it.  Signficantly slower than my logf().  Only slightly
faster than fdlibm+cygnus log10f(), but it is much more accurate.

Bruce