i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Sun Feb 20 14:52:07 PST 2005

DONE:
- Remove i387 float trig functions.
- Remove i387 acos() and asin().

TODO:
- Figure out what to do about logf(), logbf(), and log10f().
- Figure out what to do about atan(), atan2(), and atan2f().
- Figure out what to do with sin(), cos(), and tan().
  (I suggest removing the i387 double trig functions now and worrying
  about the speed of fdlibm routines later.)

On Sun, Feb 20, 2005, Bruce Evans wrote:
> I would adjust the following due to these results:
>   Think about deleting the exp and
>   log i387 float functions.

I didn't add NetBSD's e_expf.S in the first place because my tests
showed that it was slower.  :-P  As for log{,b,10}f, your tests show
that the asm versions are faster on my Pentium 4:

asmlogf: nsec per call:  40 41 40 40 40 40 40 40 40 40 40 40 40 40 40 40
fdllogf: nsec per call:  76 77 77 78 76 78 78 78 77 75 78 78 78 78 78 78
asmlogbf: nsec per call:  12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
fdllogbf: nsec per call:  18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
asmlog10f: nsec per call:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
fdllog10f: nsec per call:  80 80 71 88 71 71 95 84 71 71 71 71 72 112 96 71

> - delete all inverse trig i387 functions.

This is a clear win for asin() and acos().  It's not so clear for
atan() or atan2():

asmatan: nsec per call:  68 68 68 68 68 68 68 69 69 68 68 68 68 68 68 68
fdlatan: nsec per call:  92 92 92 92 92 94 97 70 70 97 95 92 92 92 92 92
fdlatanf: nsec per call:  70 70 70 70 70 71 72 58 58 72 70 69 70 75 71 69

This is for the same Pentium 4 as above.  Do you get different
results for a saner processor, like an Athlon?  IIRC, atan2f() was
faster in assembly according to my naive tests, or I wouldn't have
imported it.  I don't remember what inputs I tried or why I left out
atanf().

>   Think about optimizing the fdlibm versions.
>   They could use double or extended precision, and then they might not
>   need range reduction for a much larger range (hopefully [-2pi, 2pi])
>   and/or might not need a correction term for a much larger range.
[...]
> - think about optimizing the trig fdlibm double functions until they are
>   faster than the trig i387 double functions on a larger range than
>   [-pi/4, pi/4].  They could use extended precision, but only only on
>   some arches so there would be a negative reduction in complications
>   for replacing the MD functions by "MI" ones optimized in this way.
>   Does the polynomial approximation for sin start to fail between pi/4
>   and pi/2, or are there other technical rasons to reduce to the current
>   ranges?

Yeah, the float versions of just about everything would benefit
from storing the extra bits in a double.  Replacing two split
doubles with a long double is harder, as you mention; it works
(a) never, for 64-bit long doubles, (b) sometimes, for 80-bit long
doubles, and (c) always, for 128-bit long doubles.  Some of the
lookup tables for the float routines are a bit braindead, too.

It's impossible to use a polynomial approximation on [0,pi/2] or a
larger range for tan(), since tan() grows faster than any
polynomial as it approaches pi/2.  There may be a rational
approximation that works well, but I doubt it.  It is possible to
find polynomial approximations for sin() and cos() on [0,pi/2],
but assuming that Dr. Ng knows what he's doing (a reasonable
assumption), the degree of any such polynomial would likely be
very large.

By the way, Dr. Ng's paper on argument reduction mentions that the
not-so-freely-distributable libm uses table lookup for medium size
arguments that would otherwise require reduction:
http://www.ucbtest.org/arg.ps

So in summary, there's a lot of low-hanging fruit with the float
routines, but I think that doing a better job for double requires
multiple versions that use the appropriate kind of extended
precision, table lookup, or consulting a numerical analyst.

By the way, the CEPHES library (netlib/cephes or
http://www.moshier.net/) has different versions of many of these
routines.  The trig functions are also approximated on [0,pi/4],
but accurate argument reduction is not used.  I have the licensing
issues worked out with the author and core@ if we want to use any
of these.  However, my experience with exp2() and expl() from
CEPHES showed that there are some significant inaccuracies, places
where the approximating polynomial can overflow, etc.