New math library from ARM

Mon Jan 14 14:35:58 UTC 2019

On Mon, 31 Dec 2018, Steve Kargl wrote:

> On Mon, Dec 31, 2018 at 07:19:04AM -0800, Steve Kargl wrote:
>> On Mon, Dec 31, 2018 at 10:06:57AM -0500, Pedro Giffuni wrote:
>>> Hi;
>>>
>>> Just noted a recent initiative from musl-libc:
>>>
>>> https://www.openwall.com/lists/musl/2018/12/08/1
>>>
>>> It appears they plan to replace their (FreeBSD) math code with a new ARM
>>> implementation:
>>>
>>> https://github.com/ARM-software/optimized-routines
>>
>> The Copyright on this code is unclear.  For example, in
>> single/e_rem_pio2.c  lines 1-6:
>>
>> /*
>>  * e_rem_pio2.c
>>  *
>>  * Copyright (c) 1999-2018, Arm Limited.
>>  * SPDX-License-Identifier: MIT
>>  */
>>
>> Then lines 16-18:
>>
>>     /*
>>      * Simple cases: all nicked from the fdlibm version for speed.
>>      */
>>
>> The original fdlibm licenses applies ot the nicked lines.

Really?  The fdlibm version is quite slow, especially for simple cases.  In
FreeBSD, most of the optimizations were written by me, especially for
"simple" cases:
- arrange not to call here for |arg| < Pi/4
- special optiimizations for |arg| < 9*PI/4 (1 case for every pair of
   quadrants.  These optimizations are fairly generic.
- special optimizations for "medium" args.  These optimizations are mostly
   to to hide the high latency of float to integer conversions on older
   i386 x86's.  Lots of functions use similar code with many ifdefs.  My
   current version has the details for this in inline functions in
   math_private.g

It is possible to do better for arches with extra precision, but this
is not done and is still too hard for i386 on FreeBSD since extra
precision is unavailable by default, and gcc and clang have different
bugs in their support for extra precision.  But all of these optimizations
work very well for newer x86's with newer compilers.  This is because newer
x86's and compilers have lower latency and/or better scheduling.  I don't
know what is best for arm.

> There is also some interesting uses of float literal constants
> with 80 digits when at most 9 are relevant.

That might be to inspire (negative) confidence :-).

Why does the arm e_rem_pio2.c even have any float constants?  The
fdlibm version has tables of small hex constants representing
high-precision FP constants.  FreeBSD moved these tables to
k_rem_pio2.c.  IIRC, these constants are loaded as FP values and the
can probably be optimized better using not float constants more
directly, or better double constants more directly.  These tables are
only used for large args, and the fdlibm code is quite slow for large
args, but FreeBSD doesn't optimize this case.  I used to think that
x86 hardware handled this case much better than the slow fdlibm code,
but on newer x86 the fdlibm code is sometimes faster than the hardware.
On older x86, the hardware was only a few times faster.

Versions optimized for extended precision would use some long double
constants in e_rem_pio2.c.  e_rem_pio2f.c already uses extra precision
by using double precision for almost everything.  However, on newer x86,
this class of optimizations is barely worth doing, since the more
generic code in e_rem_pio2.c is fast enough (the throughput is 1 cos()
per 20-30 cycles on modern x86).

Bruce