i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Wed Feb 9 07:20:27 PST 2005

The following reply was made to PR i386/67469; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: David Schultz <das at freebsd.org>
Cc: FreeBSD-gnats-submit at freebsd.org, freebsd-i386 at freebsd.org,
	bde at freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Thu, 10 Feb 2005 02:17:45 +1100 (EST)

 On Wed, 9 Feb 2005, David Schultz wrote:

 > I ran some careful performance comparisons between the version of
 > i387 tan() I posted earlier and the fdlibm tan().  Executive
 > summary: the fdlibm tan() is faster for virtually all inputs on a
 > Pentium 4.  Pentium 3s seem to have lower-latency FPUs, but fdlibm
 > still beats the fptan instruction for the important cases where
 > fptan actually gets the right answer.

 I did some not so careful comparisions and found:
 - hardware sin is about twice as fast as fdlibm sin on athlonxp
 - hardware sin is about the same speed as fdlibm sin on athlon64.  The
   absolute speed is about the same as on athlonxp with a similar CPU
   clock (athlon64 apparently speeds up fdlibm but not hardware sin)
 - using float precision didn't make much difference (it was slightly
   slower IIRC).
 I used a uniform distribution with ranges [0..10] and [0..1000], and
 e_rem_pio2f.c was fixed to use the double version on athlonxp but not
 on athlon64.

 I think newer CPUs are more likely to optimize simple instructions better
 relative to transcendentatal functions.  SSE2 doesn't help for fsin, and
 using fsin on athlonxp is slower than ever because the registers have to
 be moved from xmm to i387 via memory.  But perhaps there are separate
 ALUs that help more in real applications.  fdlibm probably works better
 in benchmarks than in real applications because its code and tables stay
 cached.

 > I used the following sets
 > of inputs:
 >
 > tbl1: small numbers
 > ...
 > tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18
 > ...
 > tbl3: large numbers
 > ...
 > tbl4: special cases

 This data may be too unusual.  Maybe the NaNs are slower.  Denormals
 would probably be slower.

 > The results below are divided into four columns.  The first is the
 > average number of clock cycles taken by the fdlibm tan() for the
 > corresponding table input above on a Pentium 4, the second is the
 > clock cycles for the assembly tan(), the third is the difference,
 > and the fourth is the percentage difference relative to column 1.
 >
 > das at VARK:/home/t/freebsd> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 1259.000000     1697.000000     438.000000      +35%
 > ...
 > das at VARK:/home/t/freebsd> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 2018.000000     1985.000000     -33.000000      -2%
 > ...
 > das at VARK:/home/t/freebsd> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 5737.000000     6078.000000     341.000000      +6%
 > ...
 > das at VARK:/home/t/freebsd> paste perf4 perf4md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 4726.000000     3234.000000     -1492.000000    -32%
 > ...
 >
 > (P.S.: Oops, forgot to compile s_sin.c with -O.)

 I get the following for the range [0..10] step 0.0000001 on athlonxp:

     257 fdlibm sin(double) (msun src)
     128 fsin(double) (libc obj)
     107 sinf(double) (inline asm src)
     151 ftan(double) (libc obj)

 In case I messed up the scaling, this translates to 50-120 nsec/call
 (TSC freq 2223MHz).  The execuion latency for fsin is 96-192 cycles
 according to the athlon32 optimization manual, so 107-128 seems about
 right.

 > I also ran the first three tests on freefall (Pentium III, using
 > the old reduction code), and got results that aren't as favorable
 > for the fdlibm version:
 >
 > das at freefall:~> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 1384.000000     442.000000      -942.000000     -68%
 > 584.000000      440.000000      -144.000000     -25%
 > ...
 > das at freefall:~> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 639.000000      656.000000      17.000000       +3%
 > ...
 > das at freefall:~> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 5751.000000     1918.000000     -3833.000000    -67%
 > ...

 Freefall is surprisingly underpowered :-).  I get similar cycle counts on it:

     232 fdlibm sin(double) (msun src)
     121 fsin(double) (libc obj)
     112 sinf(double) (inline asm src)
     178 tan(double) (libc obj (fdlibm))

 My test loop (1/10 as long as this for freefall):

 %%%
 	double d;
 	...
 	x = rdtsc();
 	for (d = 0; d < 10.0; d += 0.0000001)
 		tan(d);
 	y = rdtsc();
 %%%

 > Here, fdlibm usually wins for tbl2, which is the most important
 > class of inputs.  It is slower for the two inputs in tbl2 that are
 > close to multiples of 2pi and for large inputs, but in all
 > fairness, the i387 gets the wrong answer in those cases---hence,
 > this PR.  The i387 legitimately beats fdlibm for the small inputs,
 > for which tan(x) == x, so a special case for those earlier in
 > fdlibm would probably be beneficial.

 Special inputs take much longer according to your tests, but I hope
 thousands of cycles is not the usual case.

 > Conclusion: We should toss out the assembly versions of tan() and
 > tanf(), and possibly special-case small inputs in fdlibm tan().

 > The above data was generated using the program below, executed as
 > follows:
 > 	./a.out < tblN | grep avg | awk '{print $2}' > perfN
 > When compiling the program, it is necessary to add
 > -Dfunc=tan or -Dfunc=itan.

 For me, this gives numbers in between yours and mine.  I only tried
 hardware tan on athlonxp, and the numbers were about 2000 for most of
 tbl3, one 1000 in the middle of tbl3, and 400 for everything else.

 > #define	rdtsc(rv)	__asm __volatile("xor %%ax,%%ax\n\tcpuid\n\trdtsc" \
 > 					 : "=A" (*(rv)) : : "ebx", "ecx")

 The synchronising cpuid here is responsible for a factor of 3 difference
 for me.  Moving the rdtsc out of the loop gives the following changes
 in cycle counts:

     2000 -> [944..1420]
     1000 -> 431
     400  -> 132

 Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
 the results costs another 120 cycles.

 I think the cpuid is disturbing the timings too much.

 Bruce