i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Wed Feb 9 23:30:19 PST 2005

The following reply was made to PR i386/67469; it has been noted by GNATS.

From: David Schultz <das at freebsd.org>
To: Bruce Evans <bde at zeta.org.au>
Cc: FreeBSD-gnats-submit at freebsd.org, freebsd-i386 at freebsd.org,
	bde at freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Thu, 10 Feb 2005 02:23:14 -0500

 Mime-Version: 1.0
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 In-Reply-To: <20050209232758.F3249 at epsplex.bde.org>

 On Thu, Feb 10, 2005, Bruce Evans wrote:
 > > I used the following sets
 > > of inputs:
 > >
 > > tbl1: small numbers
 > > ...
 > > tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18
 > > ...
 > > tbl3: large numbers
 > > ...
 > > tbl4: special cases
 > 
 > This data may be too unusual.  Maybe the NaNs are slower.  Denormals
 > would probably be slower.

 The data in tbl2 are pretty usual, I think, and I measured all of
 the data points independently.  But yes, NaNs are slower, as the
 results for tbl4 indicate.

 Looking back, though, I did notice that very few of my inputs in
 tbl2 require argument reduction.  In your tests on [0..10], on the
 other hand, 92% of the inputs require argument reduction in
 fdlibm.  It would be interesting to see for which of your tests
 fdlibm is faster, and for which it is slower.  One possibility is
 that fdlibm is slower most of the time; another is that it is far
 slower for the close-to-pi/2 cases that the i387 gets wrong, and
 that messes up the averages.

 > The synchronising cpuid here is responsible for a factor of 3 difference
 > for me.  Moving the rdtsc out of the loop gives the following changes
 > in cycle counts:
 > 
 >     2000 -> [944..1420]
 >     1000 -> 431
 >     400  -> 132
 > 
 > Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
 > the results costs another 120 cycles.
 > 
 > I think the cpuid is disturbing the timings too much.

 I don't care so much about the rdtsc overhead since I'm only
 measuring relative performance.  A null function is measured as
 taking 388 cycles on my Pentium 4, but some of that is due to gcc
 getting confused by the volatile variable and generating extra
 code at -O0.

 However, it is true that I am basically measuring latency and not
 throughput.  Ordinarily, it is possible to execute FPU and CPU
 instructions simultaneously, and the FPU may even have more than
 one FU available for executing fptan.  The cpuid instructions
 clear out the pipeline and destroy any parallelism that might have
 been possible.  Your version does a better job of measuring
 throughput.  You're also right that fdlibm tan() blows out about
 512 bytes of instruction cache.

 Anyway, I unfortunately don't have time for all this.  Do you want
 the assembly versions of these to stay or not?  If so, it would be
 great if you could fix them and make sure that the result isn't
 obviously slower than fdlibm.  If not, I'll be happy to spend two
 minutes making all those pesky bugs in them go away.  ;-)