Implementation of half-cycle trignometric functions

Wed May 17 04:20:25 UTC 2017

On Wed, May 17, 2017 at 12:49:45PM +1000, Bruce Evans wrote:
> On Tue, 16 May 2017, Steve Kargl wrote:
> > Index: lib/msun/ld128/s_cospil.c
> > +long double
> > +cospil(long double x)
> > +{
> > +	volatile static const double vzero = 0;
> > +	long double ax, c, xf;
> > +	uint32_t ix;
> > +
> > +	ax = fabsl(x);
> > +
> 
> Style (lots of extra blank lines).

If there are only style issues, then we are fortunate as
I have not even compiled the code.

> > +		ax -= xf;
> > +
> 
> These extra blank lines are especially bad, since they separate related
> code.

Extra blank lines make things easier to read.

> 
> > + * FIXME: This should use a polynomial approximation in [0,0.25], but the
> > + * FIXME: increase in max ULP is probably small, so who cares.
> 
> Doesn't it already use the standard kernel, which uses a poly approx?

Whoops forgot to remove the unneeded comment.

> > +	volatile static const double vzero = 0;
> > +	long double ax, xf;
> > +	uint32_t ix;
> > +
> > +	ax = fabsl(ax);
> > +
> > +	if (ax < 1) {
> > +		if (ax < 0.5) {
> > +			if (ax < 0x1p-60) {
> > +				if (x == 0)
> > +					return (x);
> > +				ax = (double)x;
> 
> Style (unnecessary cast).

Huh?  ax is long double.  x is long double.  The cast should
remove 11 bits.

> > +				x -= ax;
> > +				t = pilo * x + pihi * x + pilo * ax
> > +				    + pihi * ax;
> 
> Style:
> - unnecessary line splitting

The is line is 80 characters without splitting (and even longer if 
one mandates pi_lo and pi_hi).  I use an 80 character wide and wrap
lines that are 79+ characters.

> > Index: lib/msun/ld80/s_cospil.c
> > ...
> > +		if (ix < 0x3ffe)			/* |x| < 0.5 */
> > +			c = __kernel_sinpil(0.5 - ax);
> > +		else if (lx < 0xc000000000000000ull) {	/* |x| < 0.75 */
> 
> Style bug (spelling of the long long abomination using "ull").  The
> explicit abomination is unfortunately needed by old versions of gcc
> with CFLAGS asking for nearly C90.

Whoops that one may have slipped through.

> > +			if (ax == 0.5)
> > +				RETURNI(0);
> 
> Perhaps pessimal to not test for 0.5 in bits.
> 
> It is pessimal to avoid generating inexact for 0.5 at all.  I don't see
> a much better way.

I tried using bits and floating point.  Saw no difference.
cospi(n.5) is exact (see n1950.pdf).  See comment in s_cospi.c
that documents special cases.

> > +		}
> > +		INSERT_LDBL80_WORDS(x, ix, lx);
> > +
> > +		ax -= x;
> > +		EXTRACT_LDBL80_WORDS(ix, lx, ax);
> 
> I don't know if this is a good method.  INSERT/EXTRACT are slow, so I
> doubt that it is.  In e_lgamma*.c:sin_pi*(), we use only a single GET
> of the high word.  This function already used EXTRACT/INSERT to classify,
> and 2 more here.  (_e_lgamma*.c used GET to classify before calling
> sin_pi*(), then 1 more GET.)  INSERT is especially slow.
> 
> Care is needed near the 0x1p63 boundary, since the fast methods require
> adding approx 0x1p63.  At or above the boundary, these methods don't
> obviously work.  Perhaps they give the correct result starting at 2 times
> the boundary, except for setting inexact.  Ugh.  Do you avoid the fast
> methods since they set inexact for most half-integers?  That costs more
> than avoiding inexact for 0.5.

n1950.pdf says cospi(n.5) is exact.  Perhaps, the __kernel_cos()
and __kernel_sin() provide an exact result.  I did not investigate.

> tanpi(0.25 is exact too, so quarter-integers need special treatment too
> if you really want to avoid all spurious inexacts.

It does.

> I now see that some of the above complications are from inlining floor().

Yes, it is an inline of floor().

> > +/*
> > + * The basic kernel for x in [0,0.25].  To exploit the kernel for cos(x),
> > + * the argument to __kernel_cospi() must be multiply by pi.  As pi is an
> > + * irrational number, this allows the splitting of pi*x into high and low
> > + * parts.
> > + */
> 
> Allows?

x = n + r.  For double, r has 53 bits.  There are no trailing low bits
from the argument reduction.  pi is irrational, and therefore not exactly
representable by 53 bits (or any number of bits).  __kernel_cospi() uses
__kernel_cos(), and the low bits come about because of pi.  I split
x into high in low.  So, we have

lo = pilo * xlo + pilo * xhi + pihi * xlo;  /* all the low bits */
hi = pihi * xhi;  /* Exact */
_2sumF(hi, lo);   /* Maybe superfluous? */

> > +
> > +static inline double
> > +__kernel_cospi(double x)
> > +{
> > +	double hi, lo;
> > +	hi = (float)x;
> 
> _2sumF is extensively documented to not work with double.  It requires
> double_t here.

It seems to work base on testing hundreds of millions of values. :-)

> > +#define	INLINE_KERNEL_SINDF
> > +#define	INLINE_KERNEL_COSDF
> 
> Prossibly not worth inlining.
> 
> > +#define  __kernel_cospif(x)   ((float)__kernel_cosdf(M_PI * (x)))
> > +#define  __kernel_sinpif(x)   ((float)__kernel_sindf(M_PI * (x)))
> 
> Style bugs (not even corrupt tabs, but 2 spaces after #define and 3 spaces
> instead of a second tab).

Ugh.  My bad.  I have nedit set to do 3-space tabs.  I forgot to
reset to use hard tabs.

> Bogus casts.  The kernels already return float, although that is pessimal
> and destroys their the xtra precision that they naturally have.

Exhaustive testing shows max ULP < 0.500xyyy where might be 0, but
can't recall now.  How exact to you want these?

> > Index: lib/msun/src/s_sinpi.c
> > ...
> > +double
> > +sinpi(double x)
> > +{
> > +	volatile static const double vzero = 0;
> > +	double ax, s;
> > +	uint32_t hx, ix, j0, lx;
> > +
> > +	EXTRACT_WORDS(hx, lx, x);
> > +	ix = hx & 0x7fffffff;
> > +	INSERT_WORDS(ax, ix, lx);
> 
> A slow way to set ax to fabs(x).

Make it work.  Make it correct.  Make it fast.  I've undertaken
the first two.

> This is otherwise much slower than sin_pi() for lgamma().  That started
> with just a GET of hx, while this EXTRACTs 2 words, like floor(), apparently
> to avoid spurious inexact.
> 
> Note that lgamma(0.5) is naturally inexact.  lgamma(integer) should be exact,
> but I doubt that it is.  Maybe we broke it.

See below.

> > +	if (ix < 0x43300000) {			/* 1 <= |x| < 0x1p52 */
> > +		/* Determine integer part of ax. */
> > +		j0 = ((ix >> 20) & 0x7ff) - 0x3ff;
> > +		if (j0 < 20) {
> > +			ix &= ~(0x000fffff >> j0);
> > +			lx = 0;
> > +		} else {
> > +			lx &= ~((uint32_t)0xffffffff >> (j0 - 20));
> > +		}
> > +		INSERT_WORDS(x, ix, lx);
> > +
> > +		ax -= x;
> > +		EXTRACT_WORDS(ix, lx, ax);
> 
> Much slower than what lgamma()s sin_pi() does.  After our optimizations,
> that does basically rint() inline using ax + 0x1p52 - 0x1p52,

I found a situation where this trick seems to round in the wrong direction.
You participated in the discussion. 

https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000658.html
https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000660.html

> then a
> further += 0x1p50 to get octants instead of half-cycles.  This takes
> approx 16 cycles latency on modern x86, which is lot.  I think and64 can
> do the above in 10-20 cycles, while i386 might take 20-40 or much
> longer with store-to-load mismatch penalties.
> 
> I think we can avoid the spurious inexacts with a little care.  When
> ax is an integer (and < 0x1p52), we can add at least 0x1p51 to it exactly.
> We also want to avoid inexact for half-integers for sinpi and cospi, and
> quarter-integers for tanpi, and it is good to reduce to quadrants anyway.
> So we might intitially multiply by 4 and do the += 0x1p52 trick only if
> 4*ax < 0x1p52 (or whatever the safe threshold is).  Efficiency doesn't
> matter for larger ax, so slow methods using things like floor(16*ax)
> are adequate, or we can reduce by a few powers of 2 using special cases:
> 
>  	v = 4 * ax;
>  	if (v >= 0x1p54)
>  		;		/* ax is an even integer (check factor) /
>  	else {
>  		if (v >= 0x1p53)
>  			v -= 0x1p53;
>  		if (v >= 0x1p52)
>  			v -= 0x1p52;
>  	}

I'll see if I can understand this and modify waht I have. 
Unfortunately, I'm running out of time to work on this.

> The result is always +0, exact and valid (maybe -0 in unsupported
> unusual rounding modes).  We just did a lot of work to avoid inexact.

Not according to n1950.pdf.  It states

sinpi(+-n) = +-0 for positive integers.

> 
> > +	if (ax + 1 > 1)
> > +		ax = copysign(0, x);
> 
> An even slower way of setting the sign than INSERT, especially if
> the compiler doesn't inline copysign().  Using the correct +0 fixes
> this.

|x| > 0x1p(P-1) is (should be) exceptional.   Who cares if it's fast.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow