Implementation of half-cycle trignometric functions

Wed May 17 10:48:33 UTC 2017

On Tue, 16 May 2017, Steve Kargl wrote:

> On Wed, May 17, 2017 at 12:49:45PM +1000, Bruce Evans wrote:
>> On Tue, 16 May 2017, Steve Kargl wrote:
>>> Index: lib/msun/ld128/s_cospil.c
>>> +long double
>>> +cospil(long double x)
>>> +{
>>> +	volatile static const double vzero = 0;
>>> +	long double ax, c, xf;
>>> +	uint32_t ix;
>>> +
>>> +	ax = fabsl(x);
>>> +
>>
>> Style (lots of extra blank lines).
>
> If there are only style issues, then we are fortunate as
> I have not even compiled the code.
>
>>> +		ax -= xf;
>>> +
>>
>> These extra blank lines are especially bad, since they separate related
>> code.
>
> Extra blank lines make things easier to read.

Harder.  Related lines should be together.

>>> +	volatile static const double vzero = 0;
>>> +	long double ax, xf;
>>> +	uint32_t ix;
>>> +
>>> +	ax = fabsl(ax);
>>> +
>>> +	if (ax < 1) {
>>> +		if (ax < 0.5) {
>>> +			if (ax < 0x1p-60) {
>>> +				if (x == 0)
>>> +					return (x);
>>> +				ax = (double)x;
>>
>> Style (unnecessary cast).
>
> Huh?  ax is long double.  x is long double.  The cast should
> remove 11 bits.

Oops.  This special case for small x is annoying.

>>> +				x -= ax;
>>> +				t = pilo * x + pihi * x + pilo * ax
>>> +				    + pihi * ax;
>>
>> Style:
>> - unnecessary line splitting
>
> The is line is 80 characters without splitting (and even longer if
> one mandates pi_lo and pi_hi).  I use an 80 character wide and wrap
> lines that are 79+ characters.

It seemed a bit longer when I tested it before.

I missed the confusing names pessimal arrangement of terms before.  ax
is clobbered to hold the hi term, and x is clobbered to hold the lo
term.  Only the pihi * hi term (spelled pihi * ax) needs to be evaluated
exactly or even accurately.  pilo * x can possibly be ignored, but it
is better to group terms so that evaluating it is free.  After grouping
terms, the expression fits in 79 columns:

 				t = (pihi + pilo) * x + pilo * ax + pihi * ax;

Burt with better variable names, it no longer fits, unless (pihi + pilo) is
spelled pi.

>>> Index: lib/msun/ld80/s_cospil.c
>>> ...
>>> +		if (ix < 0x3ffe)			/* |x| < 0.5 */
>>> +			c = __kernel_sinpil(0.5 - ax);
>>> +		else if (lx < 0xc000000000000000ull) {	/* |x| < 0.75 */
>>
>> Style bug (spelling of the long long abomination using "ull").  The
>> explicit abomination is unfortunately needed by old versions of gcc
>> with CFLAGS asking for nearly C90.
>
> Whoops that one may have slipped through.

There are also some f suffixes which should be F.

>>> +			if (ax == 0.5)
>>> +				RETURNI(0);
>>
>> Perhaps pessimal to not test for 0.5 in bits.
>>
>> It is pessimal to avoid generating inexact for 0.5 at all.  I don't see
>> a much better way.
>
> I tried using bits and floating point.  Saw no difference.
> cospi(n.5) is exact (see n1950.pdf).  See comment in s_cospi.c
> that documents special cases.

I havven't looked at n1950.pdf.  Exact shouldn't mean "no spurious inexact".

>>> +/*
>>> + * The basic kernel for x in [0,0.25].  To exploit the kernel for cos(x),
>>> + * the argument to __kernel_cospi() must be multiply by pi.  As pi is an
>>> + * irrational number, this allows the splitting of pi*x into high and low
>>> + * parts.
>>> + */
>>
>> Allows?
>
> x = n + r.  For double, r has 53 bits.  There are no trailing low bits
> from the argument reduction.  pi is irrational, and therefore not exactly
> representable by 53 bits (or any number of bits).  __kernel_cospi() uses
> __kernel_cos(), and the low bits come about because of pi.  I split
> x into high in low.  So, we have

It is a misdescription or grammar error.  Errors start with "exploit"
instead of "use" (this is only a style bug).  Then "multiply" instead of
"multiplied".  Irrationality has little to do with things.  Most numbers,
rational, irrational and algebraic and transcendental, are not exactly
representable.  This doesn't allow splitting, but usually requires
splitting.  Transcendental numbers are less likely to need splitting than
albegraic numbers since they are more likely to be nearly representable,
but since no numeric accident for pi occurs or is arranged or easily
arrangeable near here, it does need splitting.

The source code is not the place to describe basic splitting.

> lo = pilo * xlo + pilo * xhi + pihi * xlo;  /* all the low bits */
> hi = pihi * xhi;  /* Exact */
> _2sumF(hi, lo);   /* Maybe superfluous? */

In my experiments, _2sumF was needed to prevent too much being in the lo
term.

The lo term has pessimal grouping here.  It should be written as
(pihi + pilo) * xlo + pilo * xhi or pilo * x + pihi * xlo.  The second one
is probably better since it has a shorter dependency chain for pilo * x.

>>> +static inline double
>>> +__kernel_cospi(double x)
>>> +{
>>> +	double hi, lo;
>>> +	hi = (float)x;
>>
>> _2sumF is extensively documented to not work with double.  It requires
>> double_t here.
>
> It seems to work base on testing hundreds of millions of values. :-)

It never works with extra precision.  Except when double is accidentally
the same as double_t.  That is, when the variables declared as double
remain in registers.  While they are there, they act like a hi-lo
decomposition in double_t.  This breaks as soon as the variables are
spilled or passed to a function.  You inlined the kernels and this masks
the problem.  The problem still occurs if the compiler decides not to
inline.

Example: let x be just 1.25.  The initial (hi, lo) is (1.25, 0).  After
multiplying by pi, it is (1.25 * pihi, 1.25 * pilo).  The lo term here is
small enough to work directly.  But we have to merger the terms for the
general case.  We use _2sumF().  With extra precision, this puts most
of the lo term in the hi term.  (hi, lo) is now approx. (1.25 * pi, 0)
in extra precision.  Passing this to a function or assigning it using a
C99 compiler loses all of the extra precision and gives approx.
(1.25 * pi, 0) in double precision.  If there is extra precision and the
variable type is plain double, then _2sumF() also depends on not being
compiled by a C99 compiler for efficiency.  C99 compilers produce a
(hi, lo) decomposition in plain doubles, but this is too slow to use.

My logl() depended on the double variables staying in registers not
so accidentally for many years until I found a portable way to write
an efficient _2sumF().

>>> +#define	INLINE_KERNEL_SINDF
>>> +#define	INLINE_KERNEL_COSDF
>>
>> Prossibly not worth inlining.

Inlining is a small pessimization in most of my tests in lgamma.

>> Bogus casts.  The kernels already return float, although that is pessimal
>> and destroys their the xtra precision that they naturally have.
>
> Exhaustive testing shows max ULP < 0.500xyyy where might be 0, but
> can't recall now.  How exact to you want these?

The same number of bits as there are internally.  This is about 10 extra
bits to get the accuracy of 0.5009 ulps (the error is about 1 in 1024 ot
0.0009 internally and 0.5000 more for rounding to float).

>>> Index: lib/msun/src/s_sinpi.c
>>> ...
>>> +double
>>> +sinpi(double x)
>>> +{
>>> +	volatile static const double vzero = 0;
>>> +	double ax, s;
>>> +	uint32_t hx, ix, j0, lx;
>>> +
>>> +	EXTRACT_WORDS(hx, lx, x);
>>> +	ix = hx & 0x7fffffff;
>>> +	INSERT_WORDS(ax, ix, lx);
>>
>> A slow way to set ax to fabs(x).
>
> Make it work.  Make it correct.  Make it fast.  I've undertaken
> the first two.

Simple fabs() already does the first 2 much more easily.  The fdlibm
implemenation of it is slow essentially because it does the above
EXTRACT/INSERT (a subset, but almost as slow).  fabs() is normally
a builtin and doesn't use these.

ax = (x < 0 ? -x : x) is also faster than the above, but doesn't work
right for NaNs.  It can be done after testing for NaNs.

>> Note that lgamma(0.5) is naturally inexact.  lgamma(integer) should be exact,
>> but I doubt that it is.  Maybe we broke it.

We didn't.

>>> +	if (ix < 0x43300000) {			/* 1 <= |x| < 0x1p52 */
>>> +		/* Determine integer part of ax. */
>>> +		j0 = ((ix >> 20) & 0x7ff) - 0x3ff;
>>> +		if (j0 < 20) {
>>> +			ix &= ~(0x000fffff >> j0);
>>> +			lx = 0;
>>> +		} else {
>>> +			lx &= ~((uint32_t)0xffffffff >> (j0 - 20));
>>> +		}
>>> +		INSERT_WORDS(x, ix, lx);
>>> +
>>> +		ax -= x;
>>> +		EXTRACT_WORDS(ix, lx, ax);
>>
>> Much slower than what lgamma()s sin_pi() does.  After our optimizations,
>> that does basically rint() inline using ax + 0x1p52 - 0x1p52,
>
> I found a situation where this trick seems to round in the wrong direction.
> You participated in the discussion.
>
> https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000658.html
> https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000660.html

We always had a fix for that in lgamma.  Now I'm not sure why floor() and
rint() don't use our method.

>> I think we can avoid the spurious inexacts with a little care.  When
>> ax is an integer (and < 0x1p52), we can add at least 0x1p51 to it exactly.
>> We also want to avoid inexact for half-integers for sinpi and cospi, and
>> quarter-integers for tanpi, and it is good to reduce to quadrants anyway.
>> So we might intitially multiply by 4 and do the += 0x1p52 trick only if
>> 4*ax < 0x1p52 (or whatever the safe threshold is).  Efficiency doesn't
>> matter for larger ax, so slow methods using things like floor(16*ax)
>> are adequate, or we can reduce by a few powers of 2 using special cases:
>>
>>  	v = 4 * ax;
>>  	if (v >= 0x1p54)
>>  		;		/* ax is an even integer (check factor) /
>>  	else {
>>  		if (v >= 0x1p53)
>>  			v -= 0x1p53;
>>  		if (v >= 0x1p52)
>>  			v -= 0x1p52;
>>  	}

I ended up going the mostly other way and removing all of the bit fiddling.
The preliminary +-0x1p52 was mostly a waste of time, so I removed it, but
;ater I brought it back in a different form and removed the secondatry
+-0x1p50.  Then I compressed the case statement.  The result is much simpler
but only slightly faster:

XX /*
XX  * Compute sin(pi*x), with the multiplication done after reducing x.
XX  * sin_pi(x) is only called for x < 0 and |x| < 2**(p-1) where p is
XX  * the precision of x.
XX  */
XX static double
XX sin_pi(double x)
XX {
XX 	volatile double vz;
XX 	double s,y,z;
XX 
XX 	y = -x;
XX 
XX 	/*
XX 	 * Redundant step to demonstrate a general version.  The usual magic
XX 	 * for rounding needs only no overflow, which this step ensures.
XX 	 */
XX 	if (y >= 0x1p53)
XX 	    return zero;		/* even integer */
XX 
XX 	if (y < 2)
XX 	    goto y_reduced;
XX 	vz = y+0x1p53;
XX 	z = vz-0x1p53;			/* y rounded to a multiple of 2 */
XX 	if (z > y)
XX 	    z -= 2;			/* adjust to round down */
XX 	y = y - z;			/* y mod 2 */
XX y_reduced:
XX 	if (y >= one) {
XX 	    s = -one;
XX 	    y -= one;
XX 	} else
XX 	    s = one;
XX 	if (y == zero)
XX 	    return zero;		/* avoid inexact at poles */
XX 	if (y >= 0.75)
XX 	    y -= one;			/* join with y < 0.25 case */
XX 	if (y < 0.25) {
XX 	    return __kernel_sin(pi*y,zero,0);
XX 	} else
XX 	    return __kernel_cos(pi*(y-half),zero);
XX }

YY static float
YY sin_pif(float x)
YY {
YY 	volatile float vz;
YY 	float s,y,z;
YY 
YY 	y = -x;
YY 
YY 	if (y >= 0x1p24)
YY 	    return zero;		/* y is even integer */
YY 
YY 	if (y < 2)
YY 	    goto y_reduced;
YY 	vz = y+0x1p24F;
YY 	z = vz-0x1p24F;			/* y rounded to a multiple of 2 */
YY 	if (z > y)
YY 	    z -= 2;			/* adjust to round down */
YY 	y = y - z;			/* y mod 2 */
YY y_reduced:
YY 	if (y > one) {
YY 	    s = -one;
YY 	    y -= one;
YY 	} else
YY 	    s = one;
YY 	if (y == zero)
YY 	    return zero;		/* avoid inexact at poles */
YY 	if (y >= 0.75F)
YY 	    y -= one;			/* join with y < 0.25 case */
YY 	if (y < 0.25F)
YY 	    return s*__kernel_sindf(pi*y);
YY 	else
YY 	    return s*__kernel_cosdf(pi*(y-half));
YY }

This also does the sign stuff more like you.

It works to be very sloppy with signs here.  lgamma squares the result.
But leaving out the s's makes little difference to the speed.  After
compressing the case statement, the sign handling takes little space too.

This sin_pif() with its inaccurate multiplication by 'float pi' gives
more accuracy for lgammaf() than your more accurate version.  Replacing
pi in the multiplications by M_PI acts similarly strangely.  This seems
to be because lgamma() does the calculation pi/(sin_pif(x)*x) using its
pi.  When x is small, this pi/(pi*x*x) and it is apparently better to
have the same pi in the numerator and denominator.  My test gives too
much coverage to x's near 0, enough for the error count for these cases
to dominate improvements elsewhere.

>> The result is always +0, exact and valid (maybe -0 in unsupported
>> unusual rounding modes).  We just did a lot of work to avoid inexact.
>
> Not according to n1950.pdf.  It states
>
> sinpi(+-n) = +-0 for positive integers.

Bug in n1950.pdf.  The sign depends on which side of +-n you are on,
and there is no reason to prefer either.  Rather the opposite -- 
limits with x increasing gives the opposite.

>>> +	if (ax + 1 > 1)
>>> +		ax = copysign(0, x);
>>
>> An even slower way of setting the sign than INSERT, especially if
>> the compiler doesn't inline copysign().  Using the correct +0 fixes
>> this.
>
> |x| > 0x1p(P-1) is (should be) exceptional.   Who cares if it's fast.

It is also buggy.  ax here is large, so ax + 1 > 1 is always true.
If ax is DBL_MAX, then ax + 1 gives a spurious overflow exception only
in unsupported rounding mode, and then only if there is no extra precision.

The test is apparently intended to check for odd integers.  Something
like (ax + 1 > ax) is closer to working for that.  But it doesn't work
with extra precision, and risks overflow.

My classification handles this by reducing to 0 <= y < 2 by subtracting
an even integer from y = |x|.  y == 0 means an even integer and y == 1
means an odd integer.  Then it takes special care to return +0 for odd
integers :-).  It reduces 1 to 0; it sets the sign, but doesn't use it
when y was 1.  Falling through to s*__kernel_sin(0...)  would probably
work to give the n1950.pdf bug and not give spurious inexact.

Bruce