Update ENTERI() macro

Wed Feb 27 10:16:08 UTC 2019

On Tue, 26 Feb 2019, Steve Kargl wrote:

> On Wed, Feb 27, 2019 at 05:05:15PM +1100, Bruce Evans wrote:
>> On Tue, 26 Feb 2019, Steve Kargl wrote:
>* ...
>>> Update the ENTERI() macro in math_private.h to take a parameter.
>> ...
>> I don't like this.  It churns and complicates all the simple cases
>> that only need ENTERI().  It bogotifies the existence of ENTERIT(),
> ...
> Okay.  The other option is an ENTERC() and RETURNC() as
> we need to toggle FP_PE for long double complex functions.
> I suppose I could follow the one example currently in the
> tree that use
>
> 	ENTERIT(long double complex)
>
> I find it somewhat odd that we have
>
> 	ENTERI() /* Implicit declaration of __retval to long double. */
>
> but must use directly ENTERIT(long double complex).

ENTERI() hard-codes the long double for simplicity.  Remember, it is only
needed for long double precision on i386.  But I forgot about long double
complex types, and didn't dream about indirect long double types in sincosl().

> ...
>>> -#define	RETURNI(x)	RETURNF(x)
>>> +#define	ENTERI(a)
>>> +#define	RETURNI(a)	RETURNF(a)
>>> #define	ENTERV()
>>> #define	RETURNV()	return
>>> #endif
>>
>> This also changes RETURNI(), by unimproving its parameter name.  'x' for
>> ENTERI() wasn't a very good name for a type, but is good for a variable.
>> 'x' for RETURNI() is slightly worse than 'r', but better than 'a'
>
> The renaming is for consistency.  I can use 'r'.

'r' is not quite right either, since the arg can be and is often an
expression.  'a' is good for 'arg'.

>> ...
>> But I now see 3 more problems.  The return in RETURNI() is not direct,
>> but goes through the macro RETURNF(x).  In the committed version, this
>> is a default that just returns x, but in my version it returns
>> hackdouble_t(x) or hackfloat_t(x) in some cases (no cases are needed
>> for long doubles, so there is no interaction with ENTERI()/LEAVEI(),
>> and I only do this in a few simple cases not including any with
>> complex types).
>
> I'm fine with making ENTERI() only toggle precision, and adding
> a LEAVEI() to reset precision.  RETURNI(r) would then be
>
> #define RETURNI(r)	\
> do {			\
>   LEAVEI();		\
>   return (r);		\
> } while (0)

No, may be an expression, so it must be evaluated before LEAVEI().  This
is the reason for existence of the variable to hold the result.

>> [... about complications for the general case]

>> This reminds me of a reason why I don't like sincos*().  Its API
>> requires destruction of efficiency and accuracy by returning the values
>> indirectly.  On i386 with not very old CPUs, this costs about 8 cycles per
>> long double value.  Float and double values cost about half as much.  On
>> amd64, the long double case is the same and the float and double cases
>> are faster.
>
> Not sure your efficiency claim holds.  I've seen significant improves
> in cexp and cexpf where sin[f]() and cos[f]() are replaced by
> sincos[f].  On my core2 running i386 freebsd, I see 0.1779 usecs/call
> for cexpf with sinf and cosf and 0.12522 usecs/call for sincosf.
> Yes, that's a 29.6% improvement.  For cexp the numbers are 0.2697
> usecs/call for sin and cos and 0.20586 for sincos (ie, 23.7% improvement).
> This is for z = x + I y with x and y in the non-exceptable case.

Combined sin and cos probably does work better outside of benchmarks for
sin and cos alone, since it does less work so leaves more resources for
the, more useful things.

>> sinf() and cosf() on small args take only 15-20 cycles (thoughput) on
>> amd64 with not very old CPUs, so 2-8 extra cycles for the 2 indirect
>> return values is a lot.  sincosf() still ends up being slightly faster
>> than separate sinf()/cosf().
>
> Seems to be much faster when used in other functions.

It's hard tp be much faster than 15-20 cycles.  The latency is more like
50 cycles, with 3 sinf()'s or cosf()'s running in parallel.

sincos() is far from the best possible optimization for repeated calls on
the same or nearby args.  If sin() and cos() cached the arg reduction, then
separate sin() and cos() on the same arg would run about as fast as sincos(),
and repeated sin()'s on the same arg would run much faster than now.
Caching the arg reduction may also be good when the arg changes slightly.
However, caching is slower if the args are not close.  Even a 1-entry cache
takes a long time to look up relative to the 15-20 cycles taken by sinf()
and cosf().  Caching is complicated by signal handlers and threads.  Perhaps
the right API one that has to ask for caching and provides the cache storage.
Then sincos() could be:

 	...
 	_dh_init(x, &dh);		/* prefill 1-entry cache dh */
 	s = _sin_cache(x, &dh, 1);	/* cache hit unless x is NaN
 					/* cache misses update dh */
 	c = _cos_cache(x, &dh, 1);	/* cache hit unless x is NaN
 	...

and with everything inlined this is little different from the current
sincos() except for NaNs.  NaNs can be cache hits too if you compare
them as bits, but the comparison should probably be x == dhp->dh_x
for a 1-entry cache, so as to not to have to extract the bits of x.

Bruce