Update ENTERI() macro
Steve Kargl
sgk at troutmask.apl.washington.edu
Wed Feb 27 16:19:12 UTC 2019
On Wed, Feb 27, 2019 at 09:15:52PM +1100, Bruce Evans wrote:
> On Tue, 26 Feb 2019, Steve Kargl wrote:
>
> > On Wed, Feb 27, 2019 at 05:05:15PM +1100, Bruce Evans wrote:
> >> On Tue, 26 Feb 2019, Steve Kargl wrote:
> >* ...
> >>> Update the ENTERI() macro in math_private.h to take a parameter.
> >> ...
> >> I don't like this. It churns and complicates all the simple cases
> >> that only need ENTERI(). It bogotifies the existence of ENTERIT(),
> > ...
> > Okay. The other option is an ENTERC() and RETURNC() as
> > we need to toggle FP_PE for long double complex functions.
> > I suppose I could follow the one example currently in the
> > tree that use
> >
> > ENTERIT(long double complex)
> >
> > I find it somewhat odd that we have
> >
> > ENTERI() /* Implicit declaration of __retval to long double. */
> >
> > but must use directly ENTERIT(long double complex).
>
> ENTERI() hard-codes the long double for simplicity. Remember, it is only
> needed for long double precision on i386. But I forgot about long double
> complex types, and didn't dream about indirect long double types in sincosl().
That simplicity does not work for long double complex. We will
need either ENTERIC as in
#define ENTERIC() ENTERIT(long double complex)
or a direct use of ENTERIT as you have done s_clogl.c
>
> > ...
> >>> -#define RETURNI(x) RETURNF(x)
> >>> +#define ENTERI(a)
> >>> +#define RETURNI(a) RETURNF(a)
> >>> #define ENTERV()
> >>> #define RETURNV() return
> >>> #endif
> >>
> >> This also changes RETURNI(), by unimproving its parameter name. 'x' for
> >> ENTERI() wasn't a very good name for a type, but is good for a variable.
> >> 'x' for RETURNI() is slightly worse than 'r', but better than 'a'
> >
> > The renaming is for consistency. I can use 'r'.
>
> 'r' is not quite right either, since the arg can be and is often an
> expression. 'a' is good for 'arg'.
>
> >> ...
> >> But I now see 3 more problems. The return in RETURNI() is not direct,
> >> but goes through the macro RETURNF(x). In the committed version, this
> >> is a default that just returns x, but in my version it returns
> >> hackdouble_t(x) or hackfloat_t(x) in some cases (no cases are needed
> >> for long doubles, so there is no interaction with ENTERI()/LEAVEI(),
> >> and I only do this in a few simple cases not including any with
> >> complex types).
> >
> > I'm fine with making ENTERI() only toggle precision, and adding
> > a LEAVEI() to reset precision. RETURNI(r) would then be
> >
> > #define RETURNI(r) \
> > do { \
> > LEAVEI(); \
> > return (r); \
> > } while (0)
>
> No, may be an expression, so it must be evaluated before LEAVEI(). This
> is the reason for existence of the variable to hold the result.
So, we'll need RETURNI for long double and one for long double complex.
Or, we give RETURNI a second parameter, which is the input parameter of
the function
#define RETURNI(x, r) \
do { \
x = (r) \
LEAVEI(); \
return (r); \
} while (0)
This will cause a lot of churn.
So, it seems that ENTERIC is the way forward.
> >> [... about complications for the general case]
>
> >> This reminds me of a reason why I don't like sincos*(). Its API
> >> requires destruction of efficiency and accuracy by returning the values
> >> indirectly. On i386 with not very old CPUs, this costs about 8 cycles per
> >> long double value. Float and double values cost about half as much. On
> >> amd64, the long double case is the same and the float and double cases
> >> are faster.
> >
> > Not sure your efficiency claim holds. I've seen significant improves
> > in cexp and cexpf where sin[f]() and cos[f]() are replaced by
> > sincos[f]. On my core2 running i386 freebsd, I see 0.1779 usecs/call
> > for cexpf with sinf and cosf and 0.12522 usecs/call for sincosf.
> > Yes, that's a 29.6% improvement. For cexp the numbers are 0.2697
> > usecs/call for sin and cos and 0.20586 for sincos (ie, 23.7% improvement).
> > This is for z = x + I y with x and y in the non-exceptable case.
>
> Combined sin and cos probably does work better outside of benchmarks for
> sin and cos alone, since it does less work so leaves more resources for
> the, more useful things.
Exactly! I have a significant amount of Fortran code that does
z = cmplx(cos(x), sin(x))
in modern C this is 'z = CMPLX(cos(x), sin(x))'. GCC with optimization
enables will convert this to z = cexp(cmplx(0,x)) where it expects cexp
to optimize this to sincos(). GCC on FreeBSD will not do this optimization
because FreeBSD's libm is not C99 compliant.
> >> sinf() and cosf() on small args take only 15-20 cycles (thoughput) on
> >> amd64 with not very old CPUs, so 2-8 extra cycles for the 2 indirect
> >> return values is a lot. sincosf() still ends up being slightly faster
> >> than separate sinf()/cosf().
> >
> > Seems to be much faster when used in other functions.
>
> It's hard tp be much faster than 15-20 cycles. The latency is more like
> 50 cycles, with 3 sinf()'s or cosf()'s running in parallel.
>
> sincos() is far from the best possible optimization for repeated calls on
> the same or nearby args. If sin() and cos() cached the arg reduction, then
> separate sin() and cos() on the same arg would run about as fast as sincos(),
> and repeated sin()'s on the same arg would run much faster than now.
> Caching the arg reduction may also be good when the arg changes slightly.
> However, caching is slower if the args are not close. Even a 1-entry cache
> takes a long time to look up relative to the 15-20 cycles taken by sinf()
> and cosf(). Caching is complicated by signal handlers and threads. Perhaps
> the right API one that has to ask for caching and provides the cache storage.
> Then sincos() could be:
>
> ...
> _dh_init(x, &dh); /* prefill 1-entry cache dh */
> s = _sin_cache(x, &dh, 1); /* cache hit unless x is NaN
> /* cache misses update dh */
> c = _cos_cache(x, &dh, 1); /* cache hit unless x is NaN
> ...
>
> and with everything inlined this is little different from the current
> sincos() except for NaNs. NaNs can be cache hits too if you compare
> them as bits, but the comparison should probably be x == dhp->dh_x
> for a 1-entry cache, so as to not to have to extract the bits of x.
When I worked on sincos() I tried a few variations. This included
the simpliest implementation:
void
sincos(double x, double *s, double *c)
{
*c = cos(x);
*s= sin(x);
}
I tried argument reduction with kernels.
void
sincos(double x, double *s, double *c)
{
a = inline argument reduction done to set a.
*c = k_cos(x);
*s= k_sin(x);
}
And finally the version that was committed where k_cos and k_sin
were manually inlined and re-arranged to reduce redundant computations.
Never thought about some caching mechanism. It seems to be more
complicated than it may be worth.
--
Steve
More information about the freebsd-numerics
mailing list