Implementation of half-cycle trignometric functions

Sat Apr 29 20:13:01 UTC 2017

On Sat, 29 Apr 2017, Steve Kargl wrote:

> On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
>. ...
>> Using the standard kernels is easy and works well:
>
> As your code only works on the interval [0,0.25], I took
> the liberty to use it as a __kernel_sinpi and __kernel_cospi.

You already wrote the reduction part, and lgamma already has it to.

I just checked lgamma.  Its reduction part is short (especially
after your improvements to it).  Then it is sloppy and doesn't
caclulate Pi*x in extra precision.  I think this is justified since
lgamma generally has errors much larger than 1 ulp.  it does some
micro-optimizations involving minus signs.  There is some danger that
with the kernels inline, the compiler will generate large inline
code for multiple copies whose difference is just the placement of
a single instruction to do the negation.  If this is an optimization,
then it is small.

>
>> XX double
>> XX cospi(double x)
>> XX {
>> XX 	double_t hi, lo;
>
> If sizeof(double_t) indicates what I think it means,
> This is slow on my Core2 duo (aka ia32 system).

I doubt that it is what you think it is, since you neglect testing
on i386 :-).  It makes no difference on amd64, but is needed on i386
to optimized and give correct code.  However, double_t is broken with
clang on i386 with certain CFLAGS that give SSE.  Then clang produces
code that is incompatible with FLT_EVAL_METHOD and double_t.  This
makes clangs use of SSE a pessimization.  It dodn't seem to break
correctness (the result is the same as using long double instead of
double_t).

>> XX 	hi = (float)x;
>> XX 	lo = x - hi;
>
> This is the splitting I use in my double version versions
> with hi and lo as simply double.

It will work without double_t for the splitting.  double_t (or
lots of slow STRICT_ASSIGN()s is needed for 2sum.  2sum is like
the above except it casts to double instead of float, and compilers
intentionally break casts.

>> ...
>> Efficiency is very good in some cases, but anomalous in others: all
>> times in cycles, on i386, on the range [0, 0.25]
>>
>> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
>> cos:   61-62                 44                 43
>> cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
>> ...
>
> I is unclear how you're making your measurements.   My timings
> with my kernels compared to kernels based on your code:

Basically by timing a for loop, but with many refinements to make sure
that I am timing the function and not the loop, or at least to make sure
that the loop timing doesn't change.  For example, on amd64 and i386,
the stack is aligned at 4K plus a smaller adjustable parameter since
otherwise the timing for long doubles is too variable.  Also, rebuilding
all the object files used with identical CFLAGS (usually -O
-march=better-than-native).  This probably makes my tests run 20-50
cycles per iteration faster (the simplest non-inline function takes 8
or 9 cycles and that takes running it in parallel with itself).

>      |   Bruce      |   Steve
> ------+--------------+--------------
> sinpi | 0.0742 (148) | 0.0633 (126)
> cospi | 0.0720 (144) | 0.0513 (102)
>
> First number is microseconds per call and the (xxx) is the time*cpu_freq.

The difference might be loop overhead or loss of parallelism.  IIRC, you
use something to serialize after every function.  I have 4 variations and
rarely use the serialization ones.

Or thie difference might be more that the kernels are not inlined.  I
forgot to inline them too.  I always rebuild the kernels for every test,
so they get compiled with non-default CFLAGS, but they don't get inlined
without #include statements.

> As far as over-engineering, for sinpi I find
>
> sinpi            Bruce kernel    Steve kernel
>         MAX ULP: 0.73021263     0.73955815
>    Total tested: 33554431       33554431
> 0.7 < ULP <= 0.8: 154            280
> 0.6 < ULP <= 0.7: 27650          29197

See, the kernels are already tuned almost as much as possible for
accuracy/efficiency tradeoffs.

> cospi is much more interesting and as you state above more
> difficult to get right.  I have reworked my kernel, yet,
> but I find
>
> cospi             Bruce kernel    Steve kernel
>         MAX ULP: 0.78223389      0.89921787
>    Total tested: 33554431        33554431
> 0.8 < ULP <= 0.9: 0               3262
> 0.7 < ULP <= 0.8: 9663            68305
> 0.6 < ULP <= 0.7: 132948          346214
>
> Perhaps, using double_t would reduce my max ULP at the expense
> of speed.

On i386, it should reduce the error at the expense of slowness.
On amd64, it makes no difference, but using long double should
reduct the error at the expense of speed.

Usually the reduction in the maximum error is only about 0.1 ulps
unless the algorithm uses extra precision intentionally.  This
is available on i386, but is not large enough to compensate for
the slowness.

My example of log10*() unintentionally gave an example of using
extra precision intentionally for efficiency.  Here it is for log10():

XX double
XX log10(double x)
XX {
XX 	struct Double r;
XX 	double_t hi, lo;
XX 
XX 	DOPRINT_START(&x);
XX 	k_log(x, &r);
XX 	if (!r.lo_set)
XX 		RETURNP(r.hi);
XX 	if (0 && sizeof(double_t) > sizeof(double))
XX 		RETURNP((invln10_lo + (double_t)invln10_hi) * (r.lo + r.hi));

This avoids the emulated extra-precision calculations of possible.  This
is under 'if (0)' since it is too hard to determine if extra precision is
available.  The sizeof() calculation doesn't tell you if it is.  On i386,
double_t is long double, but extra precision is not available under FreeBSD
unless the program used fpsetprec(FP_PE) to change the default of double
precision.

This is not under 'if (0)' for float precision.  Then it is assume that
the program doesn't use fpsetprec(FP_PS) to change the default to float
precision.

XX 	_2sumF(r.hi, r.lo);
XX 	hi = (float)r.hi;
XX 	lo = r.lo + (r.hi - hi);
XX 	RETURN2P(invln10_hi * hi,
XX 	    (invln10_lo + invln10_hi) * lo + invln10_lo * hi);
XX }

i386/i387, when used as designed, can execute this function much faster
than SSE by not executing the lines beginning with _2sumF().  These lines
take 20-40 cycles (latency) on modern SSE.  Throughput is better.

This would probably be a good optimization for amd64 too (using long
double instead of double_t for the extra precision).  The Intel ia64
math library uses this method of switching to extended precision often.
ia64 has a better FP design.  I don't know its details, but it seems
to combine the best features of i387 and SSE.  The FP registers should
have extra precision as on i387, but the instructions should make it
efficient to not use the extra precision for languages and programmers
that don't want it.

Bruce