Implementation of half-cycle trignometric functions
Steve Kargl
sgk at troutmask.apl.washington.edu
Wed May 17 04:20:25 UTC 2017
On Wed, May 17, 2017 at 12:49:45PM +1000, Bruce Evans wrote:
> On Tue, 16 May 2017, Steve Kargl wrote:
> > Index: lib/msun/ld128/s_cospil.c
> > +long double
> > +cospil(long double x)
> > +{
> > + volatile static const double vzero = 0;
> > + long double ax, c, xf;
> > + uint32_t ix;
> > +
> > + ax = fabsl(x);
> > +
>
> Style (lots of extra blank lines).
If there are only style issues, then we are fortunate as
I have not even compiled the code.
> > + ax -= xf;
> > +
>
> These extra blank lines are especially bad, since they separate related
> code.
Extra blank lines make things easier to read.
>
> > + * FIXME: This should use a polynomial approximation in [0,0.25], but the
> > + * FIXME: increase in max ULP is probably small, so who cares.
>
> Doesn't it already use the standard kernel, which uses a poly approx?
Whoops forgot to remove the unneeded comment.
> > + volatile static const double vzero = 0;
> > + long double ax, xf;
> > + uint32_t ix;
> > +
> > + ax = fabsl(ax);
> > +
> > + if (ax < 1) {
> > + if (ax < 0.5) {
> > + if (ax < 0x1p-60) {
> > + if (x == 0)
> > + return (x);
> > + ax = (double)x;
>
> Style (unnecessary cast).
Huh? ax is long double. x is long double. The cast should
remove 11 bits.
> > + x -= ax;
> > + t = pilo * x + pihi * x + pilo * ax
> > + + pihi * ax;
>
> Style:
> - unnecessary line splitting
The is line is 80 characters without splitting (and even longer if
one mandates pi_lo and pi_hi). I use an 80 character wide and wrap
lines that are 79+ characters.
> > Index: lib/msun/ld80/s_cospil.c
> > ...
> > + if (ix < 0x3ffe) /* |x| < 0.5 */
> > + c = __kernel_sinpil(0.5 - ax);
> > + else if (lx < 0xc000000000000000ull) { /* |x| < 0.75 */
>
> Style bug (spelling of the long long abomination using "ull"). The
> explicit abomination is unfortunately needed by old versions of gcc
> with CFLAGS asking for nearly C90.
Whoops that one may have slipped through.
> > + if (ax == 0.5)
> > + RETURNI(0);
>
> Perhaps pessimal to not test for 0.5 in bits.
>
> It is pessimal to avoid generating inexact for 0.5 at all. I don't see
> a much better way.
I tried using bits and floating point. Saw no difference.
cospi(n.5) is exact (see n1950.pdf). See comment in s_cospi.c
that documents special cases.
> > + }
> > + INSERT_LDBL80_WORDS(x, ix, lx);
> > +
> > + ax -= x;
> > + EXTRACT_LDBL80_WORDS(ix, lx, ax);
>
> I don't know if this is a good method. INSERT/EXTRACT are slow, so I
> doubt that it is. In e_lgamma*.c:sin_pi*(), we use only a single GET
> of the high word. This function already used EXTRACT/INSERT to classify,
> and 2 more here. (_e_lgamma*.c used GET to classify before calling
> sin_pi*(), then 1 more GET.) INSERT is especially slow.
>
> Care is needed near the 0x1p63 boundary, since the fast methods require
> adding approx 0x1p63. At or above the boundary, these methods don't
> obviously work. Perhaps they give the correct result starting at 2 times
> the boundary, except for setting inexact. Ugh. Do you avoid the fast
> methods since they set inexact for most half-integers? That costs more
> than avoiding inexact for 0.5.
n1950.pdf says cospi(n.5) is exact. Perhaps, the __kernel_cos()
and __kernel_sin() provide an exact result. I did not investigate.
> tanpi(0.25 is exact too, so quarter-integers need special treatment too
> if you really want to avoid all spurious inexacts.
It does.
> I now see that some of the above complications are from inlining floor().
Yes, it is an inline of floor().
> > +/*
> > + * The basic kernel for x in [0,0.25]. To exploit the kernel for cos(x),
> > + * the argument to __kernel_cospi() must be multiply by pi. As pi is an
> > + * irrational number, this allows the splitting of pi*x into high and low
> > + * parts.
> > + */
>
> Allows?
x = n + r. For double, r has 53 bits. There are no trailing low bits
from the argument reduction. pi is irrational, and therefore not exactly
representable by 53 bits (or any number of bits). __kernel_cospi() uses
__kernel_cos(), and the low bits come about because of pi. I split
x into high in low. So, we have
lo = pilo * xlo + pilo * xhi + pihi * xlo; /* all the low bits */
hi = pihi * xhi; /* Exact */
_2sumF(hi, lo); /* Maybe superfluous? */
> > +
> > +static inline double
> > +__kernel_cospi(double x)
> > +{
> > + double hi, lo;
> > + hi = (float)x;
>
> _2sumF is extensively documented to not work with double. It requires
> double_t here.
It seems to work base on testing hundreds of millions of values. :-)
> > +#define INLINE_KERNEL_SINDF
> > +#define INLINE_KERNEL_COSDF
>
> Prossibly not worth inlining.
>
> > +#define __kernel_cospif(x) ((float)__kernel_cosdf(M_PI * (x)))
> > +#define __kernel_sinpif(x) ((float)__kernel_sindf(M_PI * (x)))
>
> Style bugs (not even corrupt tabs, but 2 spaces after #define and 3 spaces
> instead of a second tab).
Ugh. My bad. I have nedit set to do 3-space tabs. I forgot to
reset to use hard tabs.
> Bogus casts. The kernels already return float, although that is pessimal
> and destroys their the xtra precision that they naturally have.
Exhaustive testing shows max ULP < 0.500xyyy where might be 0, but
can't recall now. How exact to you want these?
> > Index: lib/msun/src/s_sinpi.c
> > ...
> > +double
> > +sinpi(double x)
> > +{
> > + volatile static const double vzero = 0;
> > + double ax, s;
> > + uint32_t hx, ix, j0, lx;
> > +
> > + EXTRACT_WORDS(hx, lx, x);
> > + ix = hx & 0x7fffffff;
> > + INSERT_WORDS(ax, ix, lx);
>
> A slow way to set ax to fabs(x).
Make it work. Make it correct. Make it fast. I've undertaken
the first two.
> This is otherwise much slower than sin_pi() for lgamma(). That started
> with just a GET of hx, while this EXTRACTs 2 words, like floor(), apparently
> to avoid spurious inexact.
>
> Note that lgamma(0.5) is naturally inexact. lgamma(integer) should be exact,
> but I doubt that it is. Maybe we broke it.
See below.
> > + if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */
> > + /* Determine integer part of ax. */
> > + j0 = ((ix >> 20) & 0x7ff) - 0x3ff;
> > + if (j0 < 20) {
> > + ix &= ~(0x000fffff >> j0);
> > + lx = 0;
> > + } else {
> > + lx &= ~((uint32_t)0xffffffff >> (j0 - 20));
> > + }
> > + INSERT_WORDS(x, ix, lx);
> > +
> > + ax -= x;
> > + EXTRACT_WORDS(ix, lx, ax);
>
> Much slower than what lgamma()s sin_pi() does. After our optimizations,
> that does basically rint() inline using ax + 0x1p52 - 0x1p52,
I found a situation where this trick seems to round in the wrong direction.
You participated in the discussion.
https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000658.html
https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000660.html
> then a
> further += 0x1p50 to get octants instead of half-cycles. This takes
> approx 16 cycles latency on modern x86, which is lot. I think and64 can
> do the above in 10-20 cycles, while i386 might take 20-40 or much
> longer with store-to-load mismatch penalties.
>
> I think we can avoid the spurious inexacts with a little care. When
> ax is an integer (and < 0x1p52), we can add at least 0x1p51 to it exactly.
> We also want to avoid inexact for half-integers for sinpi and cospi, and
> quarter-integers for tanpi, and it is good to reduce to quadrants anyway.
> So we might intitially multiply by 4 and do the += 0x1p52 trick only if
> 4*ax < 0x1p52 (or whatever the safe threshold is). Efficiency doesn't
> matter for larger ax, so slow methods using things like floor(16*ax)
> are adequate, or we can reduce by a few powers of 2 using special cases:
>
> v = 4 * ax;
> if (v >= 0x1p54)
> ; /* ax is an even integer (check factor) /
> else {
> if (v >= 0x1p53)
> v -= 0x1p53;
> if (v >= 0x1p52)
> v -= 0x1p52;
> }
I'll see if I can understand this and modify waht I have.
Unfortunately, I'm running out of time to work on this.
> The result is always +0, exact and valid (maybe -0 in unsupported
> unusual rounding modes). We just did a lot of work to avoid inexact.
Not according to n1950.pdf. It states
sinpi(+-n) = +-0 for positive integers.
>
> > + if (ax + 1 > 1)
> > + ax = copysign(0, x);
>
> An even slower way of setting the sign than INSERT, especially if
> the compiler doesn't inline copysign(). Using the correct +0 fixes
> this.
|x| > 0x1p(P-1) is (should be) exceptional. Who cares if it's fast.
--
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow
More information about the freebsd-numerics
mailing list