Implementation of half-cycle trignometric functions
Steve Kargl
Sat Apr 29 19:38:30 UTC 2017
On Sun, Apr 30, 2017 at 05:09:26AM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Steve Kargl wrote:
>
> > On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote:
> >> On Fri, 28 Apr 2017, Steve Kargl wrote:
> > ...
> >>> GET_FLOAT_WORD(ix, p);
> >>> SET_FLOAT_WORD(phi, (ix >> 14) << 14);
> >>>
> >>> GET_FLOAT_WORD(ix, x2);
> >>> SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);
> >>
> >> I expect that these GET/SET's are the slowest part. They are quite fast
> >> in float prec, but in double prec on old i386 CPUs compilers generate bad
> >> code which can have penalties of 20 cycles per GET/SET.
> >>
> >> Why the strange reduction? The double shift is just a manual optimization
> >> or pssimization (usually the latter) for clearing low bits. Here it is
> >> used to clear 14 low bits instead of the usual 12. This is normally
> >> written using just a mask of 0xffff0000, unless you want a different
> >> number of bits in the hi terms for technical reasons. Double precision
> >> can benefit more from asymmetric splitting of terms since 53 is not
> >> divisible by 2; 1 hi term must have less than 26.5 bits and the other term
> >> can hold an extra bit.
> >
> > Because I didn't think about using a mask. :-)
> >
> > It's easy to change 14 to 13 or 11 or ..., while I would
> > need to write out zeros and one to come up with 0xffff8000,
> > etc.
>
> Here are some examples of more delicate splittings from the uncommitted
> clog*(). They are usually faster than GET/SET, but slower than converting
> to lower precision as is often possible for double precision and ld128
> only. clog*() can't use the casting method since it needs to split in the
> middle, and doesn't use GET/SET since it is slow. It uses methods that
> only work on args that are not too large or too small, and uses a GET
> earlier to classify the arg size.
I didn't know about these other splitting methods. Thanks for
pointing them out to me.
I updated by k_sinpif.c to use the standard masking with 0xffff0000.
It has no effect on the timing on Core2 dou. It did however effect
the max ULP. With exhaustive testing in [0x1p-14,0.25] I now have
MAX ULP: 0.68287528
Total tested: 100663296
0.6 < ULP <= 0.7: 5607
the older version with the shifts by 14 bits gives
MAX ULP: 0.73345101
Total tested: 100663296
0.7 < ULP <= 0.8: 45
0.6 < ULP <= 0.7: 11977
The value of 14 is a holdover from an earlier version.
Getting back to the use of float_t and double_t. If one
wants the performance penalty, these then work well. Changing
types to float_t in k_cospif.c, I find a slowdown of for cospif,
but I also find
MAX ULP: 0.64679509
Total tested: 1048576000
0.6 < ULP <= 0.7: 31598
with exhaustive testing in [0,0.25].
Steve
