i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs

Wed Feb 9 23:23:34 PST 2005

On Thu, Feb 10, 2005, Bruce Evans wrote:
> > I used the following sets
> > of inputs:
> >
> > tbl1: small numbers
> > ...
> > tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18
> > ...
> > tbl3: large numbers
> > ...
> > tbl4: special cases
> 
> This data may be too unusual.  Maybe the NaNs are slower.  Denormals
> would probably be slower.

The data in tbl2 are pretty usual, I think, and I measured all of
the data points independently.  But yes, NaNs are slower, as the
results for tbl4 indicate.

Looking back, though, I did notice that very few of my inputs in
tbl2 require argument reduction.  In your tests on [0..10], on the
other hand, 92% of the inputs require argument reduction in
fdlibm.  It would be interesting to see for which of your tests
fdlibm is faster, and for which it is slower.  One possibility is
that fdlibm is slower most of the time; another is that it is far
slower for the close-to-pi/2 cases that the i387 gets wrong, and
that messes up the averages.

> The synchronising cpuid here is responsible for a factor of 3 difference
> for me.  Moving the rdtsc out of the loop gives the following changes
> in cycle counts:
> 
>     2000 -> [944..1420]
>     1000 -> 431
>     400  -> 132
> 
> Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
> the results costs another 120 cycles.
> 
> I think the cpuid is disturbing the timings too much.

I don't care so much about the rdtsc overhead since I'm only
measuring relative performance.  A null function is measured as
taking 388 cycles on my Pentium 4, but some of that is due to gcc
getting confused by the volatile variable and generating extra
code at -O0.

However, it is true that I am basically measuring latency and not
throughput.  Ordinarily, it is possible to execute FPU and CPU
instructions simultaneously, and the FPU may even have more than
one FU available for executing fptan.  The cpuid instructions
clear out the pipeline and destroy any parallelism that might have
been possible.  Your version does a better job of measuring
throughput.  You're also right that fdlibm tan() blows out about
512 bytes of instruction cache.

Anyway, I unfortunately don't have time for all this.  Do you want
the assembly versions of these to stay or not?  If so, it would be
great if you could fix them and make sure that the result isn't
obviously slower than fdlibm.  If not, I'll be happy to spend two
minutes making all those pesky bugs in them go away.  ;-)