i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Sun Feb 13 06:40:15 PST 2005

The following reply was made to PR i386/67469; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: David Schultz <das at freebsd.org>
Cc: FreeBSD-gnats-submit at freebsd.org, freebsd-i386 at freebsd.org,
	bde at freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Mon, 14 Feb 2005 01:31:50 +1100 (EST)

 On Thu, 10 Feb 2005, David Schultz wrote:

 > On Thu, Feb 10, 2005, Bruce Evans wrote:

 > > > [tbl*]
 > >
 > > This data may be too unusual.  Maybe the NaNs are slower.  Denormals
 > > would probably be slower.
 >
 > The data in tbl2 are pretty usual, I think, and I measured all of
 > the data points independently.  But yes, NaNs are slower, as the
 > results for tbl4 indicate.

 It is actually the large numbers that take a lot of argument reduction
 that are slower (tbl3).

 > Looking back, though, I did notice that very few of my inputs in
 > tbl2 require argument reduction.  In your tests on [0..10], on the
 > other hand, 92% of the inputs require argument reduction in
 > fdlibm.  It would be interesting to see for which of your tests
 > fdlibm is faster, and for which it is slower.  One possibility is
 > that fdlibm is slower most of the time; another is that it is far
 > slower for the close-to-pi/2 cases that the i387 gets wrong, and
 > that messes up the averages.

 More testing of sin() on an athlon-xp shows:
 - fdlibm is faster on the range [0,pi/4-eps].  fdlibm can even be made
   almost 3 times faster than fsin on this range by inlining __kernel_sin
   and using lots of options in CFLAGS (24 nsec vs 63 nsec for inline fsin
   and 72 nsec for libc fsin, at 2.23GHz).  fdlibm doesn't need to do any
   arg reduction in this range, and the polynomial for sin() is very
   efficient (it takes less time than the function calls and logic).
 - in the range [pi/4-eps,pi/2], fdlibm does arg reduction (to convert to
   cos()) and becomes about twice as slow.  OTOH, fsin is almost twice as
   fast in this range as it is in the previous range!  Perhaps this is
   because fsin knows that its arg reduction is broken even above pi/2 so
   it can do sloppier calculations without losing significantly more.
 - for the ranges corresponding to larger multiples of pi/2, fsin slows
   down slowly and fdlibm slows down relatively rapidly.  This is because
   fdlibm actually does correct arg reduction for large values.

 > > The synchronising cpuid here is responsible for a factor of 3 difference
 > > for me.  Moving the rdtsc out of the loop gives the following changes
 > > in cycle counts:
 > >
 > >     2000 -> [944..1420]
 > >     1000 -> 431
 > >     400  -> 132
 > >
 > > Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
 > > the results costs another 120 cycles.
 > >
 > > I think the cpuid is disturbing the timings too much.
 >
 > I don't care so much about the rdtsc overhead since I'm only
 > measuring relative performance.  A null function is measured as
 > taking 388 cycles on my Pentium 4, but some of that is due to gcc
 > getting confused by the volatile variable and generating extra
 > code at -O0.

 The rdtsc() overhead (cpuid + rdtsc) needs to be subtracted to get
 relative performances that can be compared in a ratio.  On an athlon-xp
 I get the following minimum avg cycle counts for various null operations:

 2 rdtsc's alone:                                 22
 2 rdtsc's around null function:                  31
 2 cpuid+rdtsc pairs alone:                      128
 2 cpuid+rdtsc pairs around null function:       138
 2 xor+cpuid+rdtsc triples alone:                128
 2 xor+cpuid+rdtsc triples around null function: 140
 previous with -O0 (others with -O):             140

 Apparently:
 - the rdtsc overhead of 12 cycles costs for not quite each rdtsc
 - the cpuid overhead of 62 cycles costs for not quite each cpuid
 - -O0 doesn't cost much
 - the P4 pipeline is about 388 - 140 = 248 cycles longer than the
   athlon-xp's.

 > However, it is true that I am basically measuring latency and not
 > throughput.  Ordinarily, it is possible to execute FPU and CPU
 > instructions simultaneously, and the FPU may even have more than
 > one FU available for executing fptan.  The cpuid instructions
 > clear out the pipeline and destroy any parallelism that might have
 > been possible.  Your version does a better job of measuring
 > throughput.  You're also right that fdlibm tan() blows out about
 > 512 bytes of instruction cache.

 I couldn't see much evidence of parallelism in a simple benchmark.

 The main problem with using cpuid is that we don't really want
 to measure latency.  We know that the hardware math functions have
 large latency, so benchmarks that test latency are sure to show
 them not doing so well.

 > Anyway, I unfortunately don't have time for all this.  Do you want
 > the assembly versions of these to stay or not?  If so, it would be
 > great if you could fix them and make sure that the result isn't
 > obviously slower than fdlibm.  If not, I'll be happy to spend two
 > minutes making all those pesky bugs in them go away.  ;-)

 It seems that the hardware trig functions aren't worth using.  I want
 to test them on a 486 and consider the ranges more before discarding
 them.  This may take a while.

 I did a quick test of some other functions:
 - hardware sqrt is much faster
 - hardware exp is slightly faster on the range [1,100]
 - hardware atan is slower on the range [0,1.5]
 - hardware acos is much slower (139 nsec vs 57 nsec!) on the range [0,1.0].

 Bruce