long double broken on i386?

Sat Nov 3 15:55:21 PDT 2007

This thread is old, so I quoted too much.

On Wed, 10 Oct 2007, Steve Kargl wrote:

> On Wed, Oct 03, 2007 at 11:13:38AM +1000, Bruce Evans wrote:
>> On Tue, 2 Oct 2007, Steve Kargl wrote:
>>
>>> On Tue, Oct 02, 2007 at 01:23:18PM -0400, David Schultz wrote:
>>>> ...
>>>> Anyway, my point is that if you have something that works
>>>> reasonably well and doesn't have egregious errors, my suggestion
>>>> is to just commit it and not kill yourself over whether the
>>>> argument reduction is correct in the last ulp.
>>
>>> There are a few problems: 1) I don't have commit access.  2) I
>>> need to do style(9) clean-up pass.  3) I've only tested these
>>> on i386 and amd64, and I know these fail for sparc64. 4) Most
>>> importantly, I don't know how the project wants to handle the
>>> 53 vs 64 bit default fpu setting on i386.
>>
>> (3) is most important.  The C versions will never even be used for
>> i386 or amd64, since these arches have sin and cos in hardware and (I
>> believe) it is impossible to beat the hardware on these arches.  The
>> asm versions have always been trivial to implement (change 2 bytes in
>> s_sin.S (compare e_sqrt.S with e_sqrtf.S; the long double changes go
>> the other way).  amd64 actually uses the C versions for double precision
>> becase the C versions are competitive with the hardware for speed and
>> beat it for accuracy on large args and on small args aear a multiple
>> of pi/2.
>
> I've given up on i386.  All of my attempts to make my C implementations
> work on both amd64 and i386 have failed.  The code when run on i386
> results in ULP's that exceed 10^5.  Yes, that is 1E5.  Results for amd64
> are reported below.  As far as assembly, someone else will need to write
> those routines because I do not know assembly.

Bad errors are expressed in GULPs (Giga-ULPs) :-).  Maybe you are hitting
the compiler bugs that break double-precision arithmetic on i386's when
the rounding precision is changed to long double.  The rounding precision
must be changed for long doubles to work at all.  Then the compiler bugs
probably affect even long doubles, since you use k_rem_pio2.c and it uses
long doubles and is believed to be affected by the bugs.  k_rem_pio2f.c
_was_ affected by the corresponding bugs from float precision being smaller
than the default double precision, and has some fixes, but the fixes may
be incomplete and have not been merged to k_rem_pio2.c.

>> (4): These functions are useless unless the default is changed.  Users
>> have always been able to get equivalent _in_accuracy on i386 by just
>> using the double precision versions.  Hacks like #define sinl sin
>> almost work for just getting things to compile.
>
> AFAIK, C99 says the prototype in math.h is "long double sinl(long double);".
> The #define hack simply won't cut it.

I only said that it will almost work.  It will give almost
"long double sinl(double);", since "double sin(double);" gives almost
"long double sin(double);".

>> In fact, on i386, due
>> to the bugfeatures outlined in my previous mail, users can even get
>> full long double accuracy using hacks like that, even if the rounding
>> precision is 53 bits:
>
> (code removed)
>
>> On i386, this stores the full long double result for sinl(1) in x, due
>> to the bugfeatures -- sin(1) returns extra precision in a register, and
>> gcc doesn't understand the extra precision so it just stores all bits
>> of the register in x.  The behaviour would be similar if sin(1) were used
>> in an expression.
>
> Given GCC's history, relying on the accidental full long double
> accuracy because FP registers are 80 bits seems to be folly.  It

Isn't 20 years of history of doing it like this (slightly wrong) enough?
I only said it is a hack.  However, the library can depend on the
current compiler behaviour since it is tied to the compiler.

> also isn't clear to me what sin(x) will do with x > DBL_MAX.  Will
> 1e400L be interpreted as +Inf or the long double value 1e400?  Yes,
> I'm fully aware that such a large argument may not be the greatest
> idea due to argument reduction.

sin(x) (and sinl(x) if sinl is #defined as sin) must pass x as a double
like the prototype says.  Since the i386 ABI requires passing doubles
on the stack, any extra precision and argument range will be lost.
In particular, values > DBL_MAX will be converted to +Inf (except I'm
not sure if rounding down of a value just slightly larger than DBL_MAX
can give DBL_MAX).

>>> PS: There is also the minor possibility of giving bde either
>>> a heartache or death via laughter when he finally sees what I
>>> have done. :)
>>
>> I would be unhappy with any versions that aren't either:
>> - as identical as possible in algorithm and style to the double precision
>>   versions, in the same way that the float versions are as identical as
>>   possible, or
>> - much better in algorithm and style and efficiency and accuracy than the
>>   double precision versions.  I don't want to maintain different versions.
>
> I'll go with your second option see comments below and attached
> code.  BTW, if I add

So I'm unhappy that the first point hasn't been addressed at all.  This
gives quality < the current float and double precision versions.

> #ifdef sinl
> #undef sinl
> #endif
> #define sinl sin
>
> to the top of my sin_test.c code, I get ULPs that exceed 10^9.

I guess this is from loss of exponent range for args < DBL_EPSILON.
Gradual underflow (denormals) only reduces this loss a little.  After
underflow to 0, I think the loss is about 2^<max difference in exponents>
~= 2^(16384-1024) ULPs = many GULPs.  Between DBL_EPSILON and DBL_MAX,
the loss is about 2^<difference in precision> = 2^11.  Between DBL_MAX
and LDBL_MAX, the loss is infinity ULPs.

> Back to sinl, cosl, tanl.  The attached diff contains everything
> that one needs for these functions on an architecture with 64-bit
> significand long double.  Note that portions of the diff for Makefile
> and math.h contain irrelevant changes that are only relevant for
> my ccosh, sqrtl, and cbrtl changes.  Also note that the Makefile only
> includes my long double routines when LDBL_PREC == 64.  My testing on
> FreeBSD-amd64 shows
>
> troutmask:sgk[219] ./sin_test 4000000 10 400000000
>      Number of test loops to run: 10
> Number of PRN in [0,4.000000e+08] per run: 4000000
>  ULP
> 1.0000 (5)
> 1.0000 (4)
> 1.0000 (6)
> 1.0000 (4)
> 1.0000 (3)
> 1.0000 (7)
> 1.0000 (3)
> 1.0000 (7)
> 1.0000 (5)
> 1.0000 (5)
>       ---
>        49
>
> The number in parenthesis is the count of values that give a sinl(x)
> that exceeds 0.5 ULP.  On the reduced range of [0,pi/4] where
> __kernel_cosl and __kernel_sinl are used, in all my testing no value
> of x in [0,pi/4] generated a sinl(x) that exceeded 0.5 ULP.  Thus, I
> suspect (but haven't tried to confirm), the 1.0 ULP observed above is
> due to the argument reduction.

The argument reduction seems to begin correctly, but then the extra
precision which is carefully provided by k_rem_pio2.c is thrown away,
unlike in the float and double versions which pass it to the __kernel
functions.  I would expect a loss of up to a full ULP from this, giving
a maximum error of 1.5 ULPs (0.5 from the __kernel and 1 from the arg
reduction).

> My timings on the basic routines using gprof are:
>
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
>  0.6     804.42     5.67 40000000     0.00     0.00  sinl [18]
>  0.1     934.41     0.74 19998751     0.00     0.00  __kernel_sinl [95]
>  0.1     935.82     0.66 20001249     0.00     0.00  __kernel_cosl [96]
>
> where the system contains a dual-cpu opteron tyan motherboard running
> at 2.2 GHz.

That seems to be about 175 nS/call.  Not too bad.  (My current times on
a 2.2 GHz A64*1 in i386 mode on small args ([-2Pi,2Pi]) are 70 nS for
the C double version, 43 nS for the asm double version, and 15 nS for
the C float version).

> ...
> The testing of tanl() is a little more interesting.
>
> troutmask:sgk[227] ./tan_test 4000000 10 400000000
>      Number of test loops to run: 10
> Number of PRN in [0,4.000000e+08] per run: 4000000
>  ULP
> 1.0000 (24788)
> 1.0000 (24633)
> 1.5000 (24626)
> 1.0000 (24820)
> 1.0000 (24652)
> 1.5000 (24526)
> 1.0000 (24749)
> 1.0000 (24873)
> 1.0000 (24656)
> 1.0000 (24701)
>       -------
>        247024
>
>  0.5     911.88     5.80 40000000     0.00     0.00  tanl [16]
>  0.5     933.18     5.01 40000000     0.00     0.00  __kernel_tanl [53]
>
> At first glance, the 1.5 ULP and the total count of values of x
> that generate a result with a ULP that exceeds 0.5 seems disconcerting.
> However, testing of __kernel_tanl() shows that for all tested values of x
> in [0,pi/4] the maximum ULP did exceed 0.5.  This suggests that tanl() is
> more sensitive to errors in the argument reduction than sinl() and cosl().

This is because tan() has a larger derivative than sin() and cos().
E.g., tan'(Pi/4) = 2; this doubles any errors in arg reduction near
Pi/2.  The fdlibm __kernels take extra care near Pi/4 to reduce these
errors.  Also, between Pi/2 and Pi/2, the division in tan(x) = -1 /
tan(Pi/2) gives an extra error of up to 0.5 ULPs.  The fdlibm __kernels
take extra care to reduce this error.  With 3 or 4 sources of errors
in the final steps, with each error of up to 0.5 ULPs, it takes lots
of extra care to keep the final error below 1 ULP.

> Do with the code as you see fit.  I'll be moving on to hypothl and
> other functions as I have time.

I wouldn't use it directly since it doesn't copy fdlibm to get the final
steps right.  Its earlier steps are better, but I would prefer those to
copy fdlibm too, so that there aren't many different algorithms to maintain.
Note that the full long double precision versions require about the same
amount of code as the float and double precision versions combined, since
there are 2 very different long double precision formats on supported
machines (80 active bits for most, but 128 bits for sparc64).

% Index: msun/src/k_cosl.c
% ===================================================================
% RCS file: msun/src/k_cosl.c
% diff -N msun/src/k_cosl.c
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ msun/src/k_cosl.c	9 Oct 2007 21:52:32 -0000
% @@ -0,0 +1,62 @@
% +/*
% + * Compute cos(x) on the interval [-1:1] with a polynomial approximation.

Interval is larger than necessary.  This might result in unnecessary
coefficients.  10 of them seem to be too many for 64 bits of mantissa
but not enough for 113 bits (sparc64).

% + * The soefficients are determined from a Chebyshev interpolation scheme,

Spelling.

% + * In practice, this routine is called only for x in the interval [0:pi/4].
% + * Limited testing on pseudorandom numbers drawn within [0:pi/4] shows
% + * an accuracy of <= 0.5 ULP.
% + */

I think testing is neither necessary nor sufficient.  One just finds a
polynomial that has enough accuracy in infinite precision (the poly
should have about 8 more bits of precision than the number of mantissa
bits), then checks that there is no magic in coefficients than causes
problems (a large second coefficient will cause problems, especially
if it is not exactly representable.  Here 0.5 is quite large, but is
exactly representable, and for sin*(), 0.333 is just small enough not
to cause problems despite it not being exactly representable).

% +#define C0  (long double)  0x1.000000000000000p0L
% +#define C1  (long double) -0x8.000000000000000p-4L

Various style bugs.  The casts have no effect.

% +long double
% +__kernel_cosl(long double x)
% +{
% +	long double ys, fs;
% +
% +	ys = x * x;
% +
% +	fs = C0 + ys * (C1 + ys * (C2 + ys * (C3 + ys * (C4 +
% +	    ys * (C5 + ys * (C6 + ys * (C7 + ys * (C8 + ys * C9))))))));

Better scheduling is possible.

The C1 term does in fact cause problems due to it being large.  I
think writing the expression as (C0 + ys * C1 + ...) gives more
accuracy by taking advantage of C1 being exact, but that is not enough.
fdlibm __kernel cos takes extra care with the C1 term.

% Index: msun/src/k_sinl.c

Similarly.

% Index: msun/src/k_tanl.c
% ===================================================================
% RCS file: msun/src/k_tanl.c
% diff -N msun/src/k_tanl.c
% ...
% +#define C33  (long double)  0x1.8f939ed1883f595p-32L

This has far too many terms for a 64-bit mantissa, but not enough for a
113-bit mantissa.  IIRC, the former requires about 25 terms.

% +long double
% +__kernel_tanl(long double x)
% +{
% +	long double t, xs;
% + 
% +	xs = x * x;
% +
% +	t = C0 + xs * (C1 + xs * (C2 + xs * (C3 + xs * (C4 + xs *
% +	    (C5  + xs * (C6  + xs * (C7  + xs * (C8  + xs * (C9 + xs *
% +	    (C10 + xs * (C11 + xs * (C12 + xs * (C13 + xs * (C14 + xs *
% +	    (C15 + xs * (C16 + xs * (C17 + xs * (C18 + xs * (C19 + xs *
% +	    (C20 + xs * (C21 + xs * (C22 + xs * (C23 + xs * (C24 + xs *
% +	    (C25 + xs * (C26 + xs * (C27 + xs * (C28 + xs * (C29 + xs *
% +	    (C30 + xs * (C31 + xs *
% +	    (C32 + xs * C33))))))))))))))))))))))))))))))));

With so many terms, better scheduling is more needed.

Much more extra care is needed for an accuracy of < 1 ULP for tan*() than
for sin*() and cos*().

% Index: msun/src/s_cosl.c
% ...
% +/*
% + * Table of constants for 2/pi, 396 Hex digits (476 decimal) of 2/pi.
% + * The table was taken from e_rem_pio2.c.
% + */
% +static const int32_t two_over_pi[] = {
% +0xA2F983, 0x6E4E44, 0x1529FC, 0x2757D1, 0xF534DD, 0xC0DB62, 
% +0x95993C, 0x439041, 0xFE5163, 0xABDEBB, 0xC561B7, 0x246E3A,

This gives far too many copies of this table (instead of just too many).
The float and double versions avoid having a copy for each of cos/sin/tan
by doing arg reduction in e_rem_pio2[f].c.  The extra layer for this file
is a pessimization but seems to be worth it to avoid the duplication.

Are 396 hex digits enough?  It can't be right for all of the float, double
and long double cases.  I hope it is enough for 128-bit long doubles, and
the float case is just sloppy in using the same table as the double case.
If 396 hex digits is enough for all cases, then the table should be in
k_rem_pio2.c.

% +long double
% +cosl(long double x)
% +{
% +	union IEEEl2bits z;

I want to avoid using this union...

% ...
% +	z.e = x;
% +	z.bits.sign = 0;

... since, while code like this is easy to write, it hard-codes the
assumption that things are layed out in bit-fields.  Access macros
are more flexible and don't require so much compiler intelligence
to optimize away the memory accesses that are logically required for
bit-fields.  gcc does a mediocre jobs with this optimization,
especially for 80-bit long doubles on amd64 and i386 since the bits
don't fit evenly in registers.

I see parametrizing the bit-field sizes and other magic numbers (not
quoted) for various formats of long doubles as the main difficulty
here (trig functions are easy except for this and k_rem_pio2.c).

% ...
% +	for (z.e = 0, i = nx - 1; i >= 0; i--)
% +		z.e += (long double) yd[i];

Here the extra precision that is provided by k_rem_pio2.c is thrown away.

The cast has no effect.

Casts are not followed by a space in KNF.

% +
% +	switch (e0 & 3) {
% +	case 0:
% +	    z.e = __kernel_cosl(z.e);
% +	    break;

fdlibm __kernels pass an extra arg here to avoid loss of precision (except
I changed the float kernels to pass the extra precision in a double).

% Index: msun/src/s_sinl.c

Similarly.

% Index: msun/src/s_tanl.c

Similarly.

% ...
+	switch (e0 & 3) {
+ ...
+	case 1:
+	case 3:
+	    z.e = - 1.L / __kernel_tanl(z.e);
+	    break;

The fdlibm has to do this division in the __kernel since doing it carefully
takes too much code to inline here (there are two cases, one for z.e near
Pi/4 where everything is done differently, and the general case where
things are simple except for the division).

Note that the fdlibm __kernel code for trig functions is generic except for:

- my optimizations in the float case
- the magic number for the threshold near Pi/4 for tan()
- the number and values of terms in the polynomial, and the scheduling
   (best scheduling depends on the number of terms, and we don't have this
   automated).

Thus it is easier to clone the correct code than to rewrite it incorrectly.

We should investigate whether bit fiddling for the threshold and other
things in __kernel tan are needed.  I think the threshold calculation
would be faster without bit fiddling, but bit fiddling is used for it
because the bits are needed anyway.

Bruce