Implementation of half-cycle trignometric functions
Steve Kargl
sgk at troutmask.apl.washington.edu
Sat Apr 29 18:42:11 UTC 2017
On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Bruce Evans wrote:
>
> > On Fri, 28 Apr 2017, Steve Kargl wrote:
> >
> >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> >>>
> >>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61.
> >
> > Comments on this below.
> >
> > This is all rather over-engineered. Optimizing these functions is
> > unimportant comparing with finishing cosl() and sinl() and optimizing
> > all of the standard trig functions better, but we need correctness.
> > But I now see many simplifications and improvements:
> >
> > (1) There is no need for new kernels. The standard kernels already handle
> > extra precision using approximations like:
> >
> > sin(x+y) ~= sin(x) + (1-x*x/2)*y.
> >
> > Simply reduce x and write Pi*x = hi+lo. Then
> >
> > sin(Pi*x) = __kernel_sin(hi, lo, 1).
> >
> > I now see how to do the extra-precision calculations without any
> > multiplications.
>
> But that is over-engineered too.
>
> Using the standard kernels is easy and works well:
As your code only works on the interval [0,0.25], I took
the liberty to use it as a __kernel_sinpi and __kernel_cospi.
> XX double
> XX cospi(double x)
> XX {
> XX double_t hi, lo;
If sizeof(double_t) indicates what I think it means,
This is slow on my Core2 duo (aka ia32 system).
> XX hi = (float)x;
> XX lo = x - hi;
This is the splitting I use in my double version versions
with hi and lo as simply double.
> XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
> XX hi = pi_hi * hi;
> XX _2sumF(hi, lo);
> XX return __kernel_cos(hi, lo);
> XX }
> XX
>
> I only did a sloppy accuracy test for sinpi(). It was 0.03 ulps less
> accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for
> sin().
>
> Efficiency is very good in some cases, but anomalous in others: all
> times in cycles, on i386, on the range [0, 0.25]
>
> athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1
> cos: 61-62 44 43
> cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!)
> sin: 59-60 51 37
> sinpi: 67-68 (8 extra) 80 42
> tan: 136-172 93-195 67-94
> tanpi: 144-187 (8-15 extra) 145-176 61-189
>
> That was a throughput test. Latency is not so good. My latency test
> doesn't use serializing instructions, but uses random args and the
> partial serialization of making each result depend on the previous
> one.
>
> athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1
> cos: 84-85 69 79
> cospi: 103-104 (19-21 extra) 117 94
> sin: 75-76 89 77
> sinpi: 105-106 (30 extra) 116 90
> tan: 168-170 167-168 147
> tanpi: 191-194 (23-24 extra) 191 154
I is unclear how you're making your measurements. My timings
with my kernels compared to kernels based on your code:
| Bruce | Steve
------+--------------+--------------
sinpi | 0.0742 (148) | 0.0633 (126)
cospi | 0.0720 (144) | 0.0513 (102)
First number is microseconds per call and the (xxx) is the time*cpu_freq.
As far as over-engineering, for sinpi I find
sinpi Bruce kernel Steve kernel
MAX ULP: 0.73021263 0.73955815
Total tested: 33554431 33554431
0.7 < ULP <= 0.8: 154 280
0.6 < ULP <= 0.7: 27650 29197
cospi is much more interesting and as you state above more
difficult to get right. I have reworked my kernel, yet,
but I find
cospi Bruce kernel Steve kernel
MAX ULP: 0.78223389 0.89921787
Total tested: 33554431 33554431
0.8 < ULP <= 0.9: 0 3262
0.7 < ULP <= 0.8: 9663 68305
0.6 < ULP <= 0.7: 132948 346214
Perhaps, using double_t would reduce my max ULP at the expense
of speed.
--
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow
More information about the freebsd-numerics
mailing list