Shorter releng/12.0/lib/msun/i387/e_exp.S and releng/12.0/lib/msun/i387/s_finite.S

Tue Sep 10 15:19:37 UTC 2019

On Sun, 8 Sep 2019, Stefan Kanthak wrote:

I recently got diagnosed as having serious medical problems and am not sure
if I care about this...

> here's a patch to remove a conditional branch (and more) from
> http://sources.freebsd.org/releng/12.0/lib/msun/i387/e_exp.S
> plus a patch to shave some bytes (immediate operands) from
> http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_finite.S

Anyway, don't bother with these functions.  They should never have
been written in asm and should go away.

Improving the mod and remainder functions is more useful and difficult
since they are in asm on amd64 too and there seems to be no better way
to implement them on all x86 than to use the i387, but they are still
slow.

> --- -/releng/12.0/lib/msun/i387/e_exp.S
> +++ +/releng/12.0/lib/msun/i387/e_exp.S

This went away in my version in 2012 or 2013 together with implementing
the long double hyperbolic functions.  My version uses the same algorithm
in all precision for the hyperbolic functions, but only the long double
version was committed (in 2013).  The uncommitted parts are faster and
more accurate.  The same methods work relatively trivially for exp() and
expf(), except they are insignificantly faster than better d C version
after improving the accuracy of that to be slightly worse than the asm
version.  I gave up on plans to use the same algorithm in all precisions
for exp*().  The long double version is too sophisticated to be fast,
after developments in x86 CPUs and compilers made the old Sun C versions
fast.

Summary of implementations of exp*() on x86:
- expf(): use the same C version on amd64 and i386 (Cygnus translation of
   Sun version with some FreeBSD optimizations).  This is fast and is
   currently little less accurate than it should be
- exp(): use the C version on amd64 (Sun version with some FreeBSD
   optimizations).  This is fast and is currently little less accurate than
   it should be.  Use the asm version on i386.  This is slow since it switches
   the rounding precision.  It needs the 11 extra bits of precision to barely
   deliver a double precision result to within 1 ulp.

> @@ -45,7 +45,25 @@
>         movl    8(%esp),%eax
> -        andl    $0x7fffffff,%eax
> -        cmpl    $0x7ff00000,%eax
> -        jae     x_Inf_or_NaN
> +        leal    (%eax+%eax),%edx
> +        cmpl    $0xffe00000,%edx

This removes 1 instruction and 1 dependency, not a branch. Seems reasonable.
I would try to do it all in %eax.  Check what compilers do for the C version
of finite() where this check is clearer and easier to optimize (see below).
All of this can be written in C with about 1 line of inline asm, and then
compilers can generate better code.

> +        jb      finite

This seems to pessimize the branch logic in all cases (as would be done in
C by getting __predict_mumble() backwards).

The branches were carefully optimized (hopefully not backwards) for the i386
and i486 and this happens to be best for later CPUs too.  Taken branches are
slower on old CPUs, so the code was arranged to not branch in the usual
(finite) case.  Newer CPUs only use static branch prediction for the first
branch, so the branch organization rarely matters except in large code (not
like here) where moving the unusual case far away is good for caching.  The
static prediction is usuually that the first forward branch is not taken
while the first backward branch is taken.  So the forward branch to the
non-finite case was accidentally correct.

>
> +        /*
> +         * Return 0 if x is -Inf.  Otherwise just return x; when x is Inf
> +         * this gives Inf, and when x is a NaN this gives the same result
> +         * as (x + x) (x quieted).
> +         */
> +        cmpl    4(%esp),$0
> +        sbbl    $0xfff00000,%eax
> +        je      minus_inf
> +
> +nan:
>         fldl    4(%esp)
> +        ret
>
> +minus_inf:
> +        fldz
> +        ret
> +
> +finite:
> +        fldl    4(%esp)
> +
> @@ -80,19 +98,3 @@
>         ret
> -
> -x_Inf_or_NaN:
> -        /*
> -         * Return 0 if x is -Inf.  Otherwise just return x; when x is Inf
> -         * this gives Inf, and when x is a NaN this gives the same result
> -         * as (x + x) (x quieted).
> -         */
> -        cmpl    $0xfff00000,8(%esp)
> -        jne     x_not_minus_Inf
> -        cmpl    $0,4(%esp)
> -        jne     x_not_minus_Inf
> -        fldz
> -        ret
> -
> -x_not_minus_Inf:
> -        fldl    4(%esp)
> -        ret

Details not checked.  Space/time efficiency doesn't matter in the non-finite
case.  But see s_expl.c where the magic expression (-1 / x) is used for the
return value to optimize for space (it avoids branches but the division is
slow).

> END(exp)
>
> --- -/releng/12.0/lib/msun/i387/s_finite.S
> +++ +/releng/12.0/lib/msun/i387/s_finite.S

This function has several layers of reasons to not exist.  It seems to be
only a Sun extension to C90.  It is not declared in <math.h>, but exists
in libm as namespace pollution to support old ABIs.  C99 has the better
API isfinite() which is type-generic.  I thought that this was usually
inlined.  Actually, it seems to be implemented by calling __isfinite(),
and not this finite().  libm also has finite() in C.  Not inlining this
and/or having no way to know if it is efficiently inlined makes it unusable
in optimized code.

> @@ -39,8 +39,8 @@
> ENTRY(finite)
>         movl    8(%esp),%eax
> -        andl    $0x7ff00000, %eax
> -        cmpl    $0x7ff00000, %eax
> +        addl    %eax, %eax
> +        cmpl    $0xffe00000, %eax

This doesn't reduce the number of instructions or dependencies, so it is
less worth doing than similar changes above.

>         setneb  %al

This is now broken since setneb is only correct after masking out the
unimportant bits.

> -        andl    $0x000000ff, %eax
> +        movzbl  %al, %eax
>         ret

Old bug: extra instructions to avoid the branch might be a pessimization
all CPUs:
- perhaps cmov is best on newer CPUs, but it is unportable
- the extra instructions and possibly movz instead of and are slower on
   old CPUs, while branch prediction is fast for the usual case on newer
   CPUs.

> END(finite)

Check what compilers generate for the C versions of finite() and
__isfinite() with -fomit-frame-pointer -march=mumble (especially i386)
and __predict_mumble().  The best code (better than the above) is for
finite().  Oops, it is only gcc-4.2.1 that generates very bad code for
__isfinite().  s_finite.c uses masks and compilers don't reorganize
this much.  s_isfinite.c uses hard-coded bit-fields which some compilers
don't optimize very well.  Neither does the above, or the standard
access macros using bit-fields -- they tend to produce store-to-load
mismatches.

Well, I finally found where this is inlined.  Use __builtin_isfinite()
instead of isfinite().  Then gcc generates a libcall to __builtin_isfinite(),
while clang generates inline code which is much larger and often slower
than any of the above, but it at least avoids store-to-load mismatches
and doesn't misclassify long doubles in unsupported formats as finite when
they are actually NaNs.  It also generates exceptions for signaling NaNs in
some cases, which is arguably wrong.

The fpclassify and isfinite, etc., macros in <math.h> are already too
complicated but not nearly complicated enough to decide if the corresponding
builtins should be used.

Bruce