long double broken on i386?

Wed Oct 10 13:47:02 PDT 2007

On Wed, Oct 03, 2007 at 11:13:38AM +1000, Bruce Evans wrote:
> On Tue, 2 Oct 2007, Steve Kargl wrote:
> 
> >On Tue, Oct 02, 2007 at 01:23:18PM -0400, David Schultz wrote:
> >>...
> >>Anyway, my point is that if you have something that works
> >>reasonably well and doesn't have egregious errors, my suggestion
> >>is to just commit it and not kill yourself over whether the
> >>argument reduction is correct in the last ulp.
> 
> >There are a few problems: 1) I don't have commit access.  2) I
> >need to do style(9) clean-up pass.  3) I've only tested these
> >on i386 and amd64, and I know these fail for sparc64. 4) Most
> >importantly, I don't know how the project wants to handle the
> >53 vs 64 bit default fpu setting on i386.
> 
> (3) is most important.  The C versions will never even be used for
> i386 or amd64, since these arches have sin and cos in hardware and (I
> believe) it is impossible to beat the hardware on these arches.  The
> asm versions have always been trivial to implement (change 2 bytes in
> s_sin.S (compare e_sqrt.S with e_sqrtf.S; the long double changes go
> the other way).  amd64 actually uses the C versions for double precision
> becase the C versions are competitive with the hardware for speed and
> beat it for accuracy on large args and on small args aear a multiple
> of pi/2.

I've given up on i386.  All of my attempts to make my C implementations
work on both amd64 and i386 have failed.  The code when run on i386 
results in ULP's that exceed 10^5.  Yes, that is 1E5.  Results for amd64
are reported below.  As far as assembly, someone else will need to write
those routines because I do not know assembly.

> (4): These functions are useless unless the default is changed.  Users
> have always been able to get equivalent _in_accuracy on i386 by just
> using the double precision versions.  Hacks like #define sinl sin
> almost work for just getting things to compile.

AFAIK, C99 says the prototype in math.h is "long double sinl(long double);".
The #define hack simply won't cut it.

> In fact, on i386, due
> to the bugfeatures outlined in my previous mail, users can even get
> full long double accuracy using hacks like that, even if the rounding
> precision is 53 bits:

(code removed)

> On i386, this stores the full long double result for sinl(1) in x, due
> to the bugfeatures -- sin(1) returns extra precision in a register, and
> gcc doesn't understand the extra precision so it just stores all bits
> of the register in x.  The behaviour would be similar if sin(1) were used
> in an expression.

Given GCC's history, relying on the accidental full long double
accuracy because FP registers are 80 bits seems to be folly.  It
also isn't clear to me what sin(x) will do with x > DBL_MAX.  Will
1e400L be interpreted as +Inf or the long double value 1e400?  Yes,
I'm fully aware that such a large argument may not be the greatest
idea due to argument reduction.

> >PS: There is also the minor possibility of giving bde either
> >a heartache or death via laughter when he finally sees what I
> >have done. :)
> 
> I would be unhappy with any versions that aren't either:
> - as identical as possible in algorithm and style to the double precision
>   versions, in the same way that the float versions are as identical as
>   possible, or
> - much better in algorithm and style and efficiency and accuracy than the
>   double precision versions.  I don't want to maintain different versions.

I'll go with your second option see comments below and attached
code.  BTW, if I add

#ifdef sinl
#undef sinl
#endif
#define sinl sin

to the top of my sin_test.c code, I get ULPs that exceed 10^9.

>   I tried to turn your original efforts for logl() into the latter for
>   logf(), log() and logl(), but so far they are only a little better.

I lost my copy of logl() due to fat fingering a rm command and
various hard drive/file system issues.  It probably is contained
on one of my several back up tapes, but I've been too lazy to 
do the restore.

Back to sinl, cosl, tanl.  The attached diff contains everything
that one needs for these functions on an architecture with 64-bit
significand long double.  Note that portions of the diff for Makefile
and math.h contain irrelevant changes that are only relevant for
my ccosh, sqrtl, and cbrtl changes.  Also note that the Makefile only
includes my long double routines when LDBL_PREC == 64.  My testing on
FreeBSD-amd64 shows

troutmask:sgk[219] ./sin_test 4000000 10 400000000
      Number of test loops to run: 10
Number of PRN in [0,4.000000e+08] per run: 4000000
  ULP
1.0000 (5)
1.0000 (4)
1.0000 (6)
1.0000 (4)
1.0000 (3)
1.0000 (7)
1.0000 (3)
1.0000 (7)
1.0000 (5)
1.0000 (5)
       ---
        49

The number in parenthesis is the count of values that give a sinl(x)
that exceeds 0.5 ULP.  On the reduced range of [0,pi/4] where
__kernel_cosl and __kernel_sinl are used, in all my testing no value
of x in [0,pi/4] generated a sinl(x) that exceeded 0.5 ULP.  Thus, I
suspect (but haven't tried to confirm), the 1.0 ULP observed above is
due to the argument reduction.

My timings on the basic routines using gprof are:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
  0.6     804.42     5.67 40000000     0.00     0.00  sinl [18]
  0.1     934.41     0.74 19998751     0.00     0.00  __kernel_sinl [95]
  0.1     935.82     0.66 20001249     0.00     0.00  __kernel_cosl [96]

where the system contains a dual-cpu opteron tyan motherboard running
at 2.2 GHz.

The results for cosl() are

troutmask:sgk[224] ./cos_test 4000000 10 400000000
      Number of test loops to run: 10
Number of PRN in [0,4.000000e+08] per run: 4000000
  ULP
1.0000 (3)
1.0000 (4)
1.0000 (2)
1.0000 (1)
1.0000 (7)
1.0000 (5)
0.5000 (0)
1.0000 (3)
1.0000 (9)
1.0000 (5)
       ---
        39

Note, the 40 million PRN are not the same as those used in the
the testing of sinl().

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
  0.6     829.38     5.73 40000000     0.00     0.00  cosl [18]
  0.1     936.85     0.70 20000178     0.00     0.00  __kernel_sinl [80]
  0.1     938.20     0.67 19999822     0.00     0.00  __kernel_cosl [81]

The testing of tanl() is a little more interesting.

troutmask:sgk[227] ./tan_test 4000000 10 400000000
      Number of test loops to run: 10
Number of PRN in [0,4.000000e+08] per run: 4000000
  ULP
1.0000 (24788)
1.0000 (24633)
1.5000 (24626)
1.0000 (24820)
1.0000 (24652)
1.5000 (24526)
1.0000 (24749)
1.0000 (24873)
1.0000 (24656)
1.0000 (24701)
       -------
        247024

  0.5     911.88     5.80 40000000     0.00     0.00  tanl [16]
  0.5     933.18     5.01 40000000     0.00     0.00  __kernel_tanl [53]

At first glance, the 1.5 ULP and the total count of values of x
that generate a result with a ULP that exceeds 0.5 seems disconcerting.
However, testing of __kernel_tanl() shows that for all tested values of x
in [0,pi/4] the maximum ULP did exceed 0.5.  This suggests that tanl() is
more sensitive to errors in the argument reduction than sinl() and cosl().

Do with the code as you see fit.  I'll be moving on to hypothl and
other functions as I have time.

-- 
Steve
-------------- next part --------------
A non-text attachment was scrubbed...
Name: msun.diff
Type: text/x-diff
Size: 29740 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-standards/attachments/20071010/554db907/msun.bin