Update ENTERI() macro

Wed Feb 27 16:19:12 UTC 2019

On Wed, Feb 27, 2019 at 09:15:52PM +1100, Bruce Evans wrote:
> On Tue, 26 Feb 2019, Steve Kargl wrote:
> 
> > On Wed, Feb 27, 2019 at 05:05:15PM +1100, Bruce Evans wrote:
> >> On Tue, 26 Feb 2019, Steve Kargl wrote:
> >* ...
> >>> Update the ENTERI() macro in math_private.h to take a parameter.
> >> ...
> >> I don't like this.  It churns and complicates all the simple cases
> >> that only need ENTERI().  It bogotifies the existence of ENTERIT(),
> > ...
> > Okay.  The other option is an ENTERC() and RETURNC() as
> > we need to toggle FP_PE for long double complex functions.
> > I suppose I could follow the one example currently in the
> > tree that use
> >
> > 	ENTERIT(long double complex)
> >
> > I find it somewhat odd that we have
> >
> > 	ENTERI() /* Implicit declaration of __retval to long double. */
> >
> > but must use directly ENTERIT(long double complex).
> 
> ENTERI() hard-codes the long double for simplicity.  Remember, it is only
> needed for long double precision on i386.  But I forgot about long double
> complex types, and didn't dream about indirect long double types in sincosl().

That simplicity does not work for long double complex.  We will
need either ENTERIC as in

#define ENTERIC() ENTERIT(long double complex)

or a direct use of ENTERIT as you have done s_clogl.c

> 
> > ...
> >>> -#define	RETURNI(x)	RETURNF(x)
> >>> +#define	ENTERI(a)
> >>> +#define	RETURNI(a)	RETURNF(a)
> >>> #define	ENTERV()
> >>> #define	RETURNV()	return
> >>> #endif
> >>
> >> This also changes RETURNI(), by unimproving its parameter name.  'x' for
> >> ENTERI() wasn't a very good name for a type, but is good for a variable.
> >> 'x' for RETURNI() is slightly worse than 'r', but better than 'a'
> >
> > The renaming is for consistency.  I can use 'r'.
> 
> 'r' is not quite right either, since the arg can be and is often an
> expression.  'a' is good for 'arg'.
> 
> >> ...
> >> But I now see 3 more problems.  The return in RETURNI() is not direct,
> >> but goes through the macro RETURNF(x).  In the committed version, this
> >> is a default that just returns x, but in my version it returns
> >> hackdouble_t(x) or hackfloat_t(x) in some cases (no cases are needed
> >> for long doubles, so there is no interaction with ENTERI()/LEAVEI(),
> >> and I only do this in a few simple cases not including any with
> >> complex types).
> >
> > I'm fine with making ENTERI() only toggle precision, and adding
> > a LEAVEI() to reset precision.  RETURNI(r) would then be
> >
> > #define RETURNI(r)	\
> > do {		\
> >   LEAVEI();		\
> >   return (r);	\
> > } while (0)
> 
> No, may be an expression, so it must be evaluated before LEAVEI().  This
> is the reason for existence of the variable to hold the result.

So, we'll need RETURNI for long double and one for long double complex.
Or, we give RETURNI a second parameter, which is the input parameter of
the function

#define RETURNI(x, r)	\
do {			\
   x = (r)		\
   LEAVEI();		\
   return (r);		\
 } while (0)

This will cause a lot of churn.

So, it seems that ENTERIC is the way forward.

> >> [... about complications for the general case]
> 
> >> This reminds me of a reason why I don't like sincos*().  Its API
> >> requires destruction of efficiency and accuracy by returning the values
> >> indirectly.  On i386 with not very old CPUs, this costs about 8 cycles per
> >> long double value.  Float and double values cost about half as much.  On
> >> amd64, the long double case is the same and the float and double cases
> >> are faster.
> >
> > Not sure your efficiency claim holds.  I've seen significant improves
> > in cexp and cexpf where sin[f]() and cos[f]() are replaced by
> > sincos[f].  On my core2 running i386 freebsd, I see 0.1779 usecs/call
> > for cexpf with sinf and cosf and 0.12522 usecs/call for sincosf.
> > Yes, that's a 29.6% improvement.  For cexp the numbers are 0.2697
> > usecs/call for sin and cos and 0.20586 for sincos (ie, 23.7% improvement).
> > This is for z = x + I y with x and y in the non-exceptable case.
> 
> Combined sin and cos probably does work better outside of benchmarks for
> sin and cos alone, since it does less work so leaves more resources for
> the, more useful things.

Exactly!  I have a significant amount of Fortran code that does

   z = cmplx(cos(x), sin(x))

in modern C this is 'z = CMPLX(cos(x), sin(x))'.  GCC with optimization
enables will convert this to z = cexp(cmplx(0,x)) where it expects cexp
to optimize this to sincos().  GCC on FreeBSD will not do this optimization
because FreeBSD's libm is not C99 compliant.

> >> sinf() and cosf() on small args take only 15-20 cycles (thoughput) on
> >> amd64 with not very old CPUs, so 2-8 extra cycles for the 2 indirect
> >> return values is a lot.  sincosf() still ends up being slightly faster
> >> than separate sinf()/cosf().
> >
> > Seems to be much faster when used in other functions.
> 
> It's hard tp be much faster than 15-20 cycles.  The latency is more like
> 50 cycles, with 3 sinf()'s or cosf()'s running in parallel.
> 
> sincos() is far from the best possible optimization for repeated calls on
> the same or nearby args.  If sin() and cos() cached the arg reduction, then
> separate sin() and cos() on the same arg would run about as fast as sincos(),
> and repeated sin()'s on the same arg would run much faster than now.
> Caching the arg reduction may also be good when the arg changes slightly.
> However, caching is slower if the args are not close.  Even a 1-entry cache
> takes a long time to look up relative to the 15-20 cycles taken by sinf()
> and cosf().  Caching is complicated by signal handlers and threads.  Perhaps
> the right API one that has to ask for caching and provides the cache storage.
> Then sincos() could be:
> 
>  	...
>  	_dh_init(x, &dh);		/* prefill 1-entry cache dh */
>  	s = _sin_cache(x, &dh, 1);	/* cache hit unless x is NaN
>  					/* cache misses update dh */
>  	c = _cos_cache(x, &dh, 1);	/* cache hit unless x is NaN
>  	...
> 
> and with everything inlined this is little different from the current
> sincos() except for NaNs.  NaNs can be cache hits too if you compare
> them as bits, but the comparison should probably be x == dhp->dh_x
> for a 1-entry cache, so as to not to have to extract the bits of x.

When I worked on sincos() I tried a few variations.  This included
the simpliest implementation:

void
sincos(double x, double *s, double *c)
{
  *c = cos(x);
  *s=  sin(x);
}

I tried argument reduction with kernels.

void
sincos(double x, double *s, double *c)
{
  a = inline argument reduction done to set a.
  *c = k_cos(x);
  *s=  k_sin(x);
}

And finally the version that was committed where k_cos and k_sin
were manually inlined and re-arranged to reduce redundant computations.

Never thought about some caching mechanism.  It seems to be more
complicated than it may be worth.

-- 
Steve