i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Bruce Evans bde at zeta.org.au
Wed Feb 9 07:17:51 PST 2005


On Wed, 9 Feb 2005, David Schultz wrote:

> I ran some careful performance comparisons between the version of
> i387 tan() I posted earlier and the fdlibm tan().  Executive
> summary: the fdlibm tan() is faster for virtually all inputs on a
> Pentium 4.  Pentium 3s seem to have lower-latency FPUs, but fdlibm
> still beats the fptan instruction for the important cases where
> fptan actually gets the right answer.

I did some not so careful comparisions and found:
- hardware sin is about twice as fast as fdlibm sin on athlonxp
- hardware sin is about the same speed as fdlibm sin on athlon64.  The
  absolute speed is about the same as on athlonxp with a similar CPU
  clock (athlon64 apparently speeds up fdlibm but not hardware sin)
- using float precision didn't make much difference (it was slightly
  slower IIRC).
I used a uniform distribution with ranges [0..10] and [0..1000], and
e_rem_pio2f.c was fixed to use the double version on athlonxp but not
on athlon64.

I think newer CPUs are more likely to optimize simple instructions better
relative to transcendentatal functions.  SSE2 doesn't help for fsin, and
using fsin on athlonxp is slower than ever because the registers have to
be moved from xmm to i387 via memory.  But perhaps there are separate
ALUs that help more in real applications.  fdlibm probably works better
in benchmarks than in real applications because its code and tables stay
cached.

> I used the following sets
> of inputs:
>
> tbl1: small numbers
> ...
> tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18
> ...
> tbl3: large numbers
> ...
> tbl4: special cases

This data may be too unusual.  Maybe the NaNs are slower.  Denormals
would probably be slower.

> The results below are divided into four columns.  The first is the
> average number of clock cycles taken by the fdlibm tan() for the
> corresponding table input above on a Pentium 4, the second is the
> clock cycles for the assembly tan(), the third is the difference,
> and the fourth is the percentage difference relative to column 1.
>
> das at VARK:/home/t/freebsd> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
> 1259.000000     1697.000000     438.000000      +35%
> ...
> das at VARK:/home/t/freebsd> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
> 2018.000000     1985.000000     -33.000000      -2%
> ...
> das at VARK:/home/t/freebsd> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
> 5737.000000     6078.000000     341.000000      +6%
> ...
> das at VARK:/home/t/freebsd> paste perf4 perf4md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
> 4726.000000     3234.000000     -1492.000000    -32%
> ...
>
> (P.S.: Oops, forgot to compile s_sin.c with -O.)

I get the following for the range [0..10] step 0.0000001 on athlonxp:

    257 fdlibm sin(double) (msun src)
    128 fsin(double) (libc obj)
    107 sinf(double) (inline asm src)
    151 ftan(double) (libc obj)

In case I messed up the scaling, this translates to 50-120 nsec/call
(TSC freq 2223MHz).  The execuion latency for fsin is 96-192 cycles
according to the athlon32 optimization manual, so 107-128 seems about
right.

> I also ran the first three tests on freefall (Pentium III, using
> the old reduction code), and got results that aren't as favorable
> for the fdlibm version:
>
> das at freefall:~> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
> 1384.000000     442.000000      -942.000000     -68%
> 584.000000      440.000000      -144.000000     -25%
> ...
> das at freefall:~> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
> 639.000000      656.000000      17.000000       +3%
> ...
> das at freefall:~> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
> 5751.000000     1918.000000     -3833.000000    -67%
> ...

Freefall is surprisingly underpowered :-).  I get similar cycle counts on it:

    232 fdlibm sin(double) (msun src)
    121 fsin(double) (libc obj)
    112 sinf(double) (inline asm src)
    178 tan(double) (libc obj (fdlibm))

My test loop (1/10 as long as this for freefall):

%%%
	double d;
	...
	x = rdtsc();
	for (d = 0; d < 10.0; d += 0.0000001)
		tan(d);
	y = rdtsc();
%%%

> Here, fdlibm usually wins for tbl2, which is the most important
> class of inputs.  It is slower for the two inputs in tbl2 that are
> close to multiples of 2pi and for large inputs, but in all
> fairness, the i387 gets the wrong answer in those cases---hence,
> this PR.  The i387 legitimately beats fdlibm for the small inputs,
> for which tan(x) == x, so a special case for those earlier in
> fdlibm would probably be beneficial.

Special inputs take much longer according to your tests, but I hope
thousands of cycles is not the usual case.

> Conclusion: We should toss out the assembly versions of tan() and
> tanf(), and possibly special-case small inputs in fdlibm tan().

> The above data was generated using the program below, executed as
> follows:
> 	./a.out < tblN | grep avg | awk '{print $2}' > perfN
> When compiling the program, it is necessary to add
> -Dfunc=tan or -Dfunc=itan.

For me, this gives numbers in between yours and mine.  I only tried
hardware tan on athlonxp, and the numbers were about 2000 for most of
tbl3, one 1000 in the middle of tbl3, and 400 for everything else.

> #define	rdtsc(rv)	__asm __volatile("xor %%ax,%%ax\n\tcpuid\n\trdtsc" \
> 					 : "=A" (*(rv)) : : "ebx", "ecx")

The synchronising cpuid here is responsible for a factor of 3 difference
for me.  Moving the rdtsc out of the loop gives the following changes
in cycle counts:

    2000 -> [944..1420]
    1000 -> 431
    400  -> 132

Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
the results costs another 120 cycles.

I think the cpuid is disturbing the timings too much.

Bruce


More information about the freebsd-i386 mailing list