long double broken on i386?

Sat Nov 3 14:31:46 PDT 2007

On Sat, Nov 03, 2007 at 03:53:19PM +1100, Bruce Evans wrote:
> On Wed, 10 Oct 2007, Steve Kargl wrote:
> >I've given up on i386.  All of my attempts to make my C implementations
> >work on both amd64 and i386 have failed.  The code when run on i386
> >results in ULP's that exceed 10^5.  Yes, that is 1E5.  Results for amd64
> >are reported below.  As far as assembly, someone else will need to write
> >those routines because I do not know assembly.
> 
> Bad errors are expressed in GULPs (Giga-ULPs) :-).  Maybe you are hitting
> the compiler bugs that break double-precision arithmetic on i386's when
> the rounding precision is changed to long double.  The rounding precision
> must be changed for long doubles to work at all.  Then the compiler bugs
> probably affect even long doubles, since you use k_rem_pio2.c and it uses
> long doubles and is believed to be affected by the bugs.  k_rem_pio2f.c
> _was_ affected by the corresponding bugs from float precision being smaller
> than the default double precision, and has some fixes, but the fixes may
> be incomplete and have not been merged to k_rem_pio2.c.

I did some limiting testing of the output from k_rem_pio2.c
against argument reduction using MPFR.  I found no long
double values that gave results which did not agree.  I fairly
certain that the GULPs are do to improper polynomial approximations
for the 53-bit precision long doubles on i386.

> >>(4): These functions are useless unless the default is changed.  Users
> >>have always been able to get equivalent _in_accuracy on i386 by just
> >>using the double precision versions.  Hacks like #define sinl sin
> >>almost work for just getting things to compile.
> >
> >AFAIK, C99 says the prototype in math.h is "long double sinl(long 
> >double);".
> >The #define hack simply won't cut it.
> 
> I only said that it will almost work.  It will give almost
> "long double sinl(double);", since "double sin(double);" gives almost
> "long double sin(double);".

Of course, if a programmer adds the #define to her code, then I
can agree with you.  My point is that the #define is unacceptable
for /usr/include/math.h.  C99 is explicit about the prototype.

> >>In fact, on i386, due
> >>to the bugfeatures outlined in my previous mail, users can even get
> >>full long double accuracy using hacks like that, even if the rounding
> >>precision is 53 bits:
> >
> >(code removed)
> >
> >>On i386, this stores the full long double result for sinl(1) in x, due
> >>to the bugfeatures -- sin(1) returns extra precision in a register, and
> >>gcc doesn't understand the extra precision so it just stores all bits
> >>of the register in x.  The behaviour would be similar if sin(1) were used
> >>in an expression.
> >
> >Given GCC's history, relying on the accidental full long double
> >accuracy because FP registers are 80 bits seems to be folly.  It
> 
> Isn't 20 years of history of doing it like this (slightly wrong) enough?

Probably.  Unfortunately, I'm also aware of the last 15 years of 
GCC history where things can suddenly change :)

> I only said it is a hack.  However, the library can depend on the
> current compiler behaviour since it is tied to the compiler.
> 
> >also isn't clear to me what sin(x) will do with x > DBL_MAX.  Will
> >1e400L be interpreted as +Inf or the long double value 1e400?  Yes,
> >I'm fully aware that such a large argument may not be the greatest
> >idea due to argument reduction.
> 
> sin(x) (and sinl(x) if sinl is #defined as sin) must pass x as a double
> like the prototype says.  Since the i386 ABI requires passing doubles
> on the stack, any extra precision and argument range will be lost.
> In particular, values > DBL_MAX will be converted to +Inf (except I'm
> not sure if rounding down of a value just slightly larger than DBL_MAX
> can give DBL_MAX).
> 
> >>>PS: There is also the minor possibility of giving bde either
> >>>a heartache or death via laughter when he finally sees what I
> >>>have done. :)
> >>
> >>I would be unhappy with any versions that aren't either:
> >>- as identical as possible in algorithm and style to the double precision
> >>  versions, in the same way that the float versions are as identical as
> >>  possible, or
> >>- much better in algorithm and style and efficiency and accuracy than the
> >>  double precision versions.  I don't want to maintain different versions.
> >
> >I'll go with your second option see comments below and attached
> >code.  BTW, if I add
> 
> So I'm unhappy that the first point hasn't been addressed at all.  This
> gives quality < the current float and double precision versions.

You're moving the goal.  You gave me an "either/or" option.  I took
the "or", so the first point is irrelevant.  Well, your observation
on quality is subjective because I'd claim my (almost) KNF commented
code is much better than the fdlibm code.

> >#ifdef sinl
> >#undef sinl
> >#endif
> >#define sinl sin
> >
> >to the top of my sin_test.c code, I get ULPs that exceed 10^9.
> 
> I guess this is from loss of exponent range for args < DBL_EPSILON.
> Gradual underflow (denormals) only reduces this loss a little.  After
> underflow to 0, I think the loss is about 2^<max difference in exponents>
> ~= 2^(16384-1024) ULPs = many GULPs.  Between DBL_EPSILON and DBL_MAX,
> the loss is about 2^<difference in precision> = 2^11.  Between DBL_MAX
> and LDBL_MAX, the loss is infinity ULPs.

I admit that I haven't studied the issue too carefully.

> >Back to sinl, cosl, tanl.  The attached diff contains everything
> >that one needs for these functions on an architecture with 64-bit
> >significand long double.  Note that portions of the diff for Makefile
> >and math.h contain irrelevant changes that are only relevant for
> >my ccosh, sqrtl, and cbrtl changes.  Also note that the Makefile only
> >includes my long double routines when LDBL_PREC == 64.  My testing on
> >FreeBSD-amd64 shows
> >
> >troutmask:sgk[219] ./sin_test 4000000 10 400000000
> >     Number of test loops to run: 10
> >Number of PRN in [0,4.000000e+08] per run: 4000000
> > ULP
> >1.0000 (5)
> >1.0000 (4)
> >1.0000 (6)
> >1.0000 (4)
> >1.0000 (3)
> >1.0000 (7)
> >1.0000 (3)
> >1.0000 (7)
> >1.0000 (5)
> >1.0000 (5)
> >      ---
> >       49
> >
> >The number in parenthesis is the count of values that give a sinl(x)
> >that exceeds 0.5 ULP.  On the reduced range of [0,pi/4] where
> >__kernel_cosl and __kernel_sinl are used, in all my testing no value
> >of x in [0,pi/4] generated a sinl(x) that exceeded 0.5 ULP.  Thus, I
> >suspect (but haven't tried to confirm), the 1.0 ULP observed above is
> >due to the argument reduction.
> 
> The argument reduction seems to begin correctly, but then the extra
> precision which is carefully provided by k_rem_pio2.c is thrown away,
> unlike in the float and double versions which pass it to the __kernel
> functions.  I would expect a loss of up to a full ULP from this, giving
> a maximum error of 1.5 ULPs (0.5 from the __kernel and 1 from the arg
> reduction).

I had __kernel_cosl and __kernel_sinl that retained both a high
and low portion of the reduced arguments, but in the range tested
I did not find any advantage to doing the extra precision arithmetic.
These may have been discarded during my i386 GULP experience, so
I revisted the issue on amd64.

> >My timings on the basic routines using gprof are:
> >
> > %   cumulative   self              self     total
> >time   seconds   seconds    calls  ms/call  ms/call  name
> > 0.6     804.42     5.67 40000000     0.00     0.00  sinl [18]
> > 0.1     934.41     0.74 19998751     0.00     0.00  __kernel_sinl [95]
> > 0.1     935.82     0.66 20001249     0.00     0.00  __kernel_cosl [96]
> >
> >where the system contains a dual-cpu opteron tyan motherboard running
> >at 2.2 GHz.
> 
> That seems to be about 175 nS/call.  Not too bad.  (My current times on
> a 2.2 GHz A64*1 in i386 mode on small args ([-2Pi,2Pi]) are 70 nS for
> the C double version, 43 nS for the asm double version, and 15 nS for
> the C float version).

Well, I've been more concerned about correctness over speed.  I'd
need to read the grof output again, but I recall that argument
reduction was the major time consumer.

> >...
> >The testing of tanl() is a little more interesting.
> >
> >troutmask:sgk[227] ./tan_test 4000000 10 400000000
> >     Number of test loops to run: 10
> >Number of PRN in [0,4.000000e+08] per run: 4000000
> > ULP
> >1.0000 (24788)
> >1.0000 (24633)
> >1.5000 (24626)
> >1.0000 (24820)
> >1.0000 (24652)
> >1.5000 (24526)
> >1.0000 (24749)
> >1.0000 (24873)
> >1.0000 (24656)
> >1.0000 (24701)
> >      -------
> >       247024
> >
> > 0.5     911.88     5.80 40000000     0.00     0.00  tanl [16]
> > 0.5     933.18     5.01 40000000     0.00     0.00  __kernel_tanl [53]
> >
> >At first glance, the 1.5 ULP and the total count of values of x
> >that generate a result with a ULP that exceeds 0.5 seems disconcerting.
> >However, testing of __kernel_tanl() shows that for all tested values of x
> >in [0,pi/4] the maximum ULP did exceed 0.5.  This suggests that tanl() is
> >more sensitive to errors in the argument reduction than sinl() and cosl().
> 
> This is because tan() has a larger derivative than sin() and cos().
> E.g., tan'(Pi/4) = 2; this doubles any errors in arg reduction near
> Pi/2.  The fdlibm __kernels take extra care near Pi/4 to reduce these
> errors.  Also, between Pi/2 and Pi/2, the division in tan(x) = -1 /
> tan(Pi/2) gives an extra error of up to 0.5 ULPs.  The fdlibm __kernels
> take extra care to reduce this error.  With 3 or 4 sources of errors
> in the final steps, with each error of up to 0.5 ULPs, it takes lots
> of extra care to keep the final error below 1 ULP.

OK.  I'll revisit fdlibm's algorithms to see if I can adapt it
to my code.

> >Do with the code as you see fit.  I'll be moving on to hypothl and
> >other functions as I have time.
> 
> I wouldn't use it directly since it doesn't copy fdlibm to get the final
> steps right.  Its earlier steps are better, but I would prefer those to
> copy fdlibm too, so that there aren't many different algorithms to maintain.
> Note that the full long double precision versions require about the same
> amount of code as the float and double precision versions combined, since
> there are 2 very different long double precision formats on supported
> machines (80 active bits for most, but 128 bits for sparc64).
> 
> % Index: msun/src/k_cosl.c
> % ===================================================================
> % RCS file: msun/src/k_cosl.c
> % diff -N msun/src/k_cosl.c
> % --- /dev/null	1 Jan 1970 00:00:00 -0000
> % +++ msun/src/k_cosl.c	9 Oct 2007 21:52:32 -0000
> % @@ -0,0 +1,62 @@
> % +/*
> % + * Compute cos(x) on the interval [-1:1] with a polynomial approximation.
> 
> Interval is larger than necessary.  This might result in unnecessary
> coefficients.  10 of them seem to be too many for 64 bits of mantissa
> but not enough for 113 bits (sparc64).

Yes, I know.  My original Chebyshev interpolation code assumed a function
f(x) is in [-1,1], which of course makes things easy.  The code has been
updated to allow f(x) in [a,b].  I plan to update the coefficients for
the restricted range.
> 
> % + * The soefficients are determined from a Chebyshev interpolation scheme,
> 
> Spelling.
> 
> % + * In practice, this routine is called only for x in the interval 
> [0:pi/4].
> % + * Limited testing on pseudorandom numbers drawn within [0:pi/4] shows
> % + * an accuracy of <= 0.5 ULP.
> % + */
> 
> I think testing is neither necessary nor sufficient.  One just finds a
> polynomial that has enough accuracy in infinite precision (the poly
> should have about 8 more bits of precision than the number of mantissa
> bits), then checks that there is no magic in coefficients than causes
> problems (a large second coefficient will cause problems, especially
> if it is not exactly representable.  Here 0.5 is quite large, but is
> exactly representable, and for sin*(), 0.333 is just small enough not
> to cause problems despite it not being exactly representable).
> 
> % +#define C0  (long double)  0x1.000000000000000p0L
> % +#define C1  (long double) -0x8.000000000000000p-4L
> 
> Various style bugs.  The casts have no effect.

Whoops, sorry about that.  This is a holdover from my attempts
at double-double precision arithmetic.  

> 
> % Index: msun/src/k_tanl.c
> % ===================================================================
> % RCS file: msun/src/k_tanl.c
> % diff -N msun/src/k_tanl.c
> % ...
> % +#define C33  (long double)  0x1.8f939ed1883f595p-32L
> 
> This has far too many terms for a 64-bit mantissa, but not enough for a
> 113-bit mantissa.  IIRC, the former requires about 25 terms.

This is partly due to the range being larger than required.

> % +long double
> % +__kernel_tanl(long double x)
> % +{
> % +	long double t, xs;
> % + 
> % +	xs = x * x;
> % +
> % +	t = C0 + xs * (C1 + xs * (C2 + xs * (C3 + xs * (C4 + xs *
> % +	    (C5  + xs * (C6  + xs * (C7  + xs * (C8  + xs * (C9 + xs *
> % +	    (C10 + xs * (C11 + xs * (C12 + xs * (C13 + xs * (C14 + xs *
> % +	    (C15 + xs * (C16 + xs * (C17 + xs * (C18 + xs * (C19 + xs *
> % +	    (C20 + xs * (C21 + xs * (C22 + xs * (C23 + xs * (C24 + xs *
> % +	    (C25 + xs * (C26 + xs * (C27 + xs * (C28 + xs * (C29 + xs *
> % +	    (C30 + xs * (C31 + xs *
> % +	    (C32 + xs * C33))))))))))))))))))))))))))))))));
> 
> With so many terms, better scheduling is more needed.
> 
> Much more extra care is needed for an accuracy of < 1 ULP for tan*() than
> for sin*() and cos*().

Well, I did code both steps 3 and 4 from k_tan.c, but I saw no
statistical improvement in my testing.

> % Index: msun/src/s_cosl.c
> % ...
> % +/*
> % + * Table of constants for 2/pi, 396 Hex digits (476 decimal) of 2/pi.
> % + * The table was taken from e_rem_pio2.c.
> % + */
> % +static const int32_t two_over_pi[] = {
> % +0xA2F983, 0x6E4E44, 0x1529FC, 0x2757D1, 0xF534DD, 0xC0DB62, 
> % +0x95993C, 0x439041, 0xFE5163, 0xABDEBB, 0xC561B7, 0x246E3A,
> 
> This gives far too many copies of this table (instead of just too many).
> The float and double versions avoid having a copy for each of cos/sin/tan
> by doing arg reduction in e_rem_pio2[f].c.  The extra layer for this file
> is a pessimization but seems to be worth it to avoid the duplication.
> 
> Are 396 hex digits enough?  It can't be right for all of the float, double
> and long double cases.  I hope it is enough for 128-bit long doubles, and
> the float case is just sloppy in using the same table as the double case.
> If 396 hex digits is enough for all cases, then the table should be in
> k_rem_pio2.c.

In my testing, 396 hex digits appear to be sufficient for 64-bit 
long doubles.  Without access to 128-bit long double, I won't be
investigating the issue.

> % +long double
> % +cosl(long double x)
> % +{
> % +	union IEEEl2bits z;
> 
> I want to avoid using this union...
>
> % +	z.e = x;
> % +	z.bits.sign = 0;
> 
> ... since, while code like this is easy to write, it hard-codes the
> assumption that things are layed out in bit-fields.  Access macros
> are more flexible and don't require so much compiler intelligence
> to optimize away the memory accesses that are logically required for
> bit-fields.  gcc does a mediocre jobs with this optimization,
> especially for 80-bit long doubles on amd64 and i386 since the bits
> don't fit evenly in registers.

math_private.h doesn't have GET_* or SET_* macros for use with
64-bit long double (not to mention 128-bit long double), and
the macros in math_private.h use similar unions.  AFAICT, all
supported archs have an _fpmath.h header.

> I see parametrizing the bit-field sizes and other magic numbers (not
> quoted) for various formats of long doubles as the main difficulty
> here (trig functions are easy except for this and k_rem_pio2.c).

Which explains why GET_* and SET_* don't exist.  Although these
might be present in bdeBSD. :)

> % Index: msun/src/s_tanl.c
> 
> Similarly.
> 
> % ...
> +	switch (e0 & 3) {
> + ...
> +	case 1:
> +	case 3:
> +	    z.e = - 1.L / __kernel_tanl(z.e);
> +	    break;
> 
> The fdlibm has to do this division in the __kernel since doing it carefully
> takes too much code to inline here (there are two cases, one for z.e near
> Pi/4 where everything is done differently, and the general case where
> things are simple except for the division).
> 
> Note that the fdlibm __kernel code for trig functions is generic except for:
> 
> - my optimizations in the float case
> - the magic number for the threshold near Pi/4 for tan()
> - the number and values of terms in the polynomial, and the scheduling
>   (best scheduling depends on the number of terms, and we don't have this
>   automated).
> 
> Thus it is easier to clone the correct code than to rewrite it incorrectly.

It may be easier to clone fdlibm, but I'm not sure I'd call it correct
code due to
  - magic numbers
  - questionable source formatting (it some of the worse code I've read)
  - lack of comments
-- 
Steve