Implementation of half-cycle trignometric functions
Steve Kargl
sgk at troutmask.apl.washington.edu
Sat Apr 29 18:10:24 UTC 2017
On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
>
> > On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> >>
> >> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61.
>
> Comments on this below.
Way too much information to digest in one reading. I'll
answer the things that are easy ...
> >> x2 = x * x;
> >> p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
> >> s = x * (x2 * (double)p + s0hi + s0lo);
> >> return ((float)s);
> >> }
>
> I don't like the style being completely different. Why lower-case constants?
> ...
Having teethed on musty FORTRAN code, I have an adversion to uppercase.
I also do not like camel case, which seems to plague modern languages.
> > Well, starting with above and playing the splitting game
> > with x, x2, and p. I've manage to reduce the max ULP in
> > the region testd. Testing against MPFR with sin(pi*x)
> > computed in 5*24-bit precision gives
>
> It already cheats by using (double)p, but is limited by using float to
> caclulate p.
Yes, the above is cheating to the extent that the above code
started out as completely written in double. This gives
max ULP < 0.51. I then started reducing the double to float
with the intent of finding when max ULP exceeds some chosen
threshold. The above gave something like max ULP < 0.55.
I then moved the (double) cast to different terms/factors,
observed how the max ULP changed. This then gives information
about what requires extra precision.
> > static inline float
> > __kernel_sinpif(float x)
> > {
> > float p, phi, x2, x2hi, x2lo, xhi, xlo;
> > uint32_t ix;
> >
> > x2 = x * x;
> > p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
> >
> > GET_FLOAT_WORD(ix, p);
> > SET_FLOAT_WORD(phi, (ix >> 14) << 14);
> >
> > GET_FLOAT_WORD(ix, x2);
> > SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);
>
> I expect that these GET/SET's are the slowest part. They are quite fast
> in float prec, but in double prec on old i386 CPUs compilers generate bad
> code which can have penalties of 20 cycles per GET/SET.
>
> Why the strange reduction? The double shift is just a manual optimization
> or pssimization (usually the latter) for clearing low bits. Here it is
> used to clear 14 low bits instead of the usual 12. This is normally
> written using just a mask of 0xffff0000, unless you want a different
> number of bits in the hi terms for technical reasons. Double precision
> can benefit more from asymmetric splitting of terms since 53 is not
> divisible by 2; 1 hi term must have less than 26.5 bits and the other term
> can hold an extra bit.
Because I didn't think about using a mask. :-)
It's easy to change 14 to 13 or 11 or ..., while I would
need to write out zeros and one to come up with 0xffff8000,
etc.
--
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow
More information about the freebsd-hackers
mailing list