i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Sun Feb 20 03:03:00 PST 2005

On Mon, 14 Feb 2005, Bruce Evans wrote:

> It seems that the hardware trig functions aren't worth using.  I want
> to test them on a 486 and consider the ranges more before discarding
> them.  This may take a while.

I did this as threatened.

My test program now covers most cases of interest for i386's and
amd64's.  It divides the the range being tested into a number of
regions.  I mostly used 16, which is good for dividing the range
[-2pi, 2pi] into quadrants.  The output clearly shows that both
the hardware and fdlibm do arg reduction related to pi/4, with
different tradeoffs.  E.g., for sin(), fdlibm does very well for
the range [-pi/4, pi/4], but not so well for args outside this
range, while hardware sin does poorly for [-pi/4, pi/4] but very
well for [pi/4, 3pi/4], and similarly for all translations of these
ranges by pi.  OTOH, the hardware cos's best range is [-pi/4, pi/4]
but fdlibm cos's best range is the same as for fdlibm sin.

Some results:
- the hardware trig functions were by far the best on a 486DX2-66 for
  all ranges except near 0 where fdlibm is competetive or a little
  faster.  The relative advantage of the hardware functions decreases
  with each CPU generation.
- the hardware inverse trig functions (all based on fpatan) were by far
  the worst for all ranges, except on old machines where they are
  competitive or a little faster.
- fdlibm is quite good for exp and log too.
- an Athlon64 (cpuid says 3400 but not that fast in marketed version)
  running at 1994 MHz with slow memory has almost exactly the same
  speed for the fdlibm part of the benchmark as an AthlonXP Barton
  2600 overclocked to 2293 MHz with not-so-slow memory.  Using SSE2
  apparently makes just enough difference to compensate for the
  1994/2223 clock speed ratio.  There is apparently no similar
  difference for using the A64 FPU so using it is further away from
  being an optimization on A64's.

Test control script:

%%%
set -e
arch=`uname -p`
if [ $arch = i386 ]; then arch=i387; fi
msun=/usr/src/lib/msun
CFLAGS="-march=athlon-xp -fomit-frame-pointer -O2 -g -I$msun/src"
LDFLAGS="-lm -static"
niter=100000000.0
stdfiles="$msun/src/e_rem_pio2.c $msun/src/k_cos.c $msun/src/k_sin.c \
    $msun/src/k_tan.c"

while :
do
	read functype func llimit limit nregion basename ename
	if [ -z $func ]; then exit 0; fi
	cfile=$msun/src/$basename.c
	sfile=$msun/$arch/$basename.S
	if [ -f $sfile -o $func = cos -o $func = cosf \
	    -o $func = sin -o $func = sinf ]
	then
		myfunc=asm$func
		COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \
		    -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \
		    -DNITER=$niter"
		if [ -f $sfile ]
		then
			sed -e "s/^ENTRY($func)/ENTRY($myfunc)/" \
			    -e "s/^ENTRY($ename)/ENTRY($myfunc)/" <$sfile >x.S
			cc $CFLAGS $COPTS -o z z.c x.S $stdfiles $LDFLAGS
		else
			cc $CFLAGS $COPTS -o z z.c $LDFLAGS
		fi
		time ./z
	fi
	myfunc=fdl$func
	COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \
	    -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \
	    -DNITER=$niter"
	sed -e s/^$ename/$myfunc/ \
	    -e s/__generic_$ename/$myfunc/ <$cfile >x.c
	cc $CFLAGS $COPTS -o z z.c x.c $stdfiles $LDFLAGS
	time ./z
done
%%%

Run this using something like "sh t <tdata >to.machine".  (First edit
the CFLAGS line to change -march to something appropriate, and the
niter line to make the test run fast enough -- I used 1e8 on Athlons
down to 1e6 on 486DX2-66 so that each function takes about 10 seconds.)

Test data:

%%%
double acos -1.0 1.0 16 e_acos __ieee754_acos
double asin -1.0 1.0 16 e_asin __ieee754_asin
double atan -8.0 8.0 16 s_atan atan
double cos -6.28 6.28 16 s_cos cos
double exp -8.0 8.0 16 e_exp __ieee754_exp
double log 0.0 1e32 16 e_log __ieee754_log
double logb 0.0 1e32 16 s_logb logb
double log10 0.0 1e32 16 e_log10 __ieee754_log10
double sin -6.28 6.28 16 s_sin sin
double tan -6.28 6.28 16 s_tan tan
float cosf -6.28 6.28 16 s_cosf cosf
float logf 0.0 1e32 16 e_logf __ieee754_logf
float logbf 0.0 1e32 16 s_logbf logbf
float log10f 0.0 1e32 16 e_log10f __ieee754_log10f
float sinf -6.28 6.28 16 s_sinf sinf
float tanf -6.28 6.28 16 s_tanf tanf

# The following aren't actually implemented in asm (but e_atan2f is; it is
# probably just as bad as e_atanf would be, but is harder to test).
float acosf -1.0 1.0 16 e_acosf __ieee754_acosf
float asinf -1.0 1.0 16 e_asinf __ieee754_asinf
float atanf -8.0 8.0 16 s_atanf atanf
float expf -8.0 8.0 16 e_expf __ieee754_expf
%%%

Test program:

%%%
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>

#ifdef HAVE_RDTSC
#include <machine/cpufunc.h>
#endif

#include <math.h>
#include <stdio.h>

#ifndef FUNC
#define	FUNC	sin
#endif
#define	FUNCNAME __XSTRING(FUNC)
#ifndef FUNCTYPE
#define	FUNCTYPE double
#endif
#ifndef LIMIT
#define	LIMIT	(3.14159 / 4)
#endif
#ifndef LLIMIT
#define	LLIMIT	0.0
#endif
#ifndef NITER
#define NITER	10000000.0
#endif
#ifndef NREGION
#define NREGION	16
#endif

#ifdef __amd64__
/* Generate some asm versions since there are none in libm. */

double asmcos(double);
double asmsin(double);
float asmcosf(float);
float asmsinf(float);

asm("asmcos: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fcos; "
    "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret");
asm("asmsin: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fsin; "
    "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret");
asm("asmcosf: movss %xmm0, -4(%rsp); flds -4(%rsp); fcos; "
    "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret");
asm("asmsinf: movss %xmm0, -4(%rsp); flds -4(%rsp); fsin; "
    "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret");
#endif /* __amd64__ */

FUNCTYPE FUNC(FUNCTYPE);

int
main(void)
{
	struct rusage finish, start;
	double d, limit, llimit, step;
	long long tot[NREGION], usec[NREGION], x, y;
	int i;

	step = (LIMIT - LLIMIT) / NITER;
	for (i = 0; i < NREGION; i++) {
		llimit = LLIMIT + i * (LIMIT - LLIMIT) / NREGION;
		limit = LLIMIT + (i + 1) * (LIMIT - LLIMIT) / NREGION;
		tot[i] = 0;
		getrusage(RUSAGE_SELF, &start);
#ifdef HAVE_RDTSC
		x = rdtsc();
#endif
		for (d = llimit; d < limit; d += step)
			FUNC(d);
#ifdef HAVE_RDTSC
		y = rdtsc();
#endif
		getrusage(RUSAGE_SELF, &finish);
		usec[i] = 1000000 *
		    (finish.ru_utime.tv_sec - start.ru_utime.tv_sec +
		    finish.ru_stime.tv_sec - start.ru_stime.tv_sec) +
		    finish.ru_utime.tv_usec - start.ru_utime.tv_usec +
		    finish.ru_stime.tv_usec - start.ru_stime.tv_usec;
		tot[i] = y - x;
	}
	printf("%s:", FUNCNAME);
#ifdef HAVE_RDTSC
	printf(" cycles:nsec per call:");
	for (i = 0; i < NREGION; i++)
		printf(" %lld:%lld",
		    tot[i] / (long long)(NITER / NREGION),
		    1000 * usec[i] / (long long)(NITER / NREGION));
#else
	printf(" nsec per call: ");
	for (i = 0; i < NREGION; i++)
		printf(" %lld", 1000 * usec[i] / (long long)(NITER / NREGION));
#endif
	printf("\n");
	return (0);
}
%%%

Sample output:

%%%
to.486dx2-66:
asmasin: nsec per call:  7569 7650 7742 7752 7665 7662 7584 7721 7409 7573 7645 7651 7723 7775 7686 7603
fdlasin: nsec per call:  13452 13910 13967 13979 8000 7997 7995 8178 7872 7873 7994 7995 13853 13844 13808 13323
asmsin: nsec per call:  5697 5788 5821 5642 5684 5774 5534 5557 5417 5788 5763 5558 5650 5828 5782 5701
fdlsin: nsec per call:  11888 11906 11902 11736 11866 9405 9414 5377 5372 9250 9247 11320 11303 11435 11417 11326

to.k6-233:
asmasin: nsec per call:  1051 1076 1073 1072 1054 1041 1042 1042 1038 1037 1037 1050 1067 1069 1072 1046
fdlasin: nsec per call:  1518 1594 1595 1594 819 819 818 818 818 819 819 819 1591 1590 1593 1514
asmsin: nsec per call:  513 553 543 527 518 540 521 464 464 521 540 518 527 543 553 513
fdlsin: nsec per call:  1466 1495 1493 1475 1474 1034 1034 647 647 1017 1017 1418 1419 1459 1459 1410

to.axpb-2223:
asmasin: nsec per call:  96 96 94 93 93 93 93 93 93 93 93 93 93 94 96 96
fdlasin: nsec per call:  77 81 81 81 36 35 34 34 35 35 35 35 81 81 81 77
asmsin: nsec per call:  60 32 33 59 59 32 32 57 56 32 32 58 58 33 32 59
fdlsin: nsec per call:  84 91 91 85 85 69 69 32 32 68 67 79 79 86 86 79

to.a64-1994:
asmsin: nsec per call:  61 37 37 60 61 36 37 58 58 37 36 61 60 37 37 61
fdlsin: nsec per call:  73 75 75 75 75 52 52 21 21 51 51 71 71 70 70 67
%%%

I would adjust the following due to these results:
- delete all trig i387 float functions.  Think about deleting the exp and
  log i387 float functions.  Think about optimizing the fdlibm versions.
  They could use double or extended precision, and then they might not
  need range reduction for a much larger range (hopefully [-2pi, 2pi])
  and/or might not need a correction term for a much larger range.
- delete all inverse trig i387 functions.
- think about optimizing the trig fdlibm double functions until they are
  faster than the trig i387 double functions on a larger range than
  [-pi/4, pi/4].  They could use extended precision, but only only on
  some arches so there would be a negative reduction in complications
  for replacing the MD functions by "MI" ones optimized in this way.
  Does the polynomial approximation for sin start to fail between pi/4
  and pi/2, or are there other technical rasons to reduce to the current
  ranges?

Bruce