[PATCH] hypotl, cabsl, and code removal in cabs

Mon Nov 5 04:35:23 PST 2007

On Sat, 3 Nov 2007, Steve Kargl wrote:

> On Sat, Nov 03, 2007 at 04:15:27PM +1100, Bruce Evans wrote:
>>
>> On Fri, 12 Oct 2007, Steve Kargl wrote:
>>
>>> The attached patch implements hypotl(), cabsl(), and removes
>>> z_abs() from w_cabs.c.  z_abs() is undocumented namespace
>>> pollution in libm, and no header file under /usr/include
>>> provides a prototype.  The documentation has been updated,
>>> and in particular an incorrect paragraph in cabs.3 has been
>>> removed.
>>
>> I fear that that paragraph is correct again, since the long double
>> version only seems to take enough care with the overflow cases.

I verified that the > 1.2 ULP error is back using the same algorithm
for a float-precision versions.  Bugs were also easier to find in
lower precision.

> Ah, no.  Underflow is prevented by the if() statement.
>
> 		a.e *= a.e;
> 		m -= n;
> 		if (m > - PREC) {
> 			b.bits.exp += m;
> 			a.e += b.e * b.e;
> 		}
> 		a.e = sqrtl(a.e);
> 		a.bits.exp += n;
> 		return (a.e);
>
> If m < -PREC, then |b|/|a| < 2**(-32), so it skips the "a.e += b.e * b.e;"
> "a.e. *= a.e" can't underflow because a.e is in [0.5,1), and
> "a.e = sqrtl(a.e);" can't underflow because, here, a.e is in [0.25,2).

The problem has nothing to do with underflow (I don't even see underflow
as a problem.  Avoiding underflow is just an optimization)).  The
problem is to limit roundoff error in intermediate calculations so
that the final error is < 1 ULP.  More below.

BTW, PREC = 32 seems a little low.  I would have expected PREC = 33
to be needed to give 2 guard bits.  However, 32 for 64-bit long double
precision corresponds to 12 for 24-bit float precision, but even 6
seems to work for float precision.

>> % Index: man/hypot.3
>> % ...
>> % @@ -82,11 +90,6 @@ Consequently
>> %  exactly;
>> %  in general, hypot and cabs return an integer whenever an
>> %  integer might be expected.
>> % -.Pp
>> % -The same cannot be said for the shorter and faster version of hypot
>> % -and cabs that is provided in the comments in cabs.c; its error can
>> % -exceed 1.2
>> % -.Em ulps .
>
> This paragraph is useless.  There is no cabs.c in the source tree,
> and in particular, there is no cabs.c in an installed system.  Surely,
> this paragraph isn't referring to src/contrib/libf2c/libF77/cabs.c.
> It has one comment with a total of one word.

cabs.c is only in cvs history (libm/ieee/cabs.c).  It is instructive
since the ifdefed-out version (not exactly the commented-out version)
gets the 1.2 ULP error in much the same way as your version, while the
primary version (by Ng) has to work hard to limit the error to 0.959
ULPs.  The version in fdlibm works slightly harder to reduce the error
a bit more.

Here is the main part of the ifdefed-out version:

% 	if(finite(x))
% 	    if(finite(y))
% 	    {
% 		x=copysign(x,one);
% 		y=copysign(y,one);
% 		if(y > x)
% 		    { temp=x; x=y; y=temp; }
% 		if(x == zero) return(zero);
% 		if(y == zero) return(x);
% 		exp= logb(x);
% 		x=scalb(x,-exp);
% 		if(exp-(int)logb(y) > ibig )
% 			/* raise inexact flag and return |x| */
% 		   { one+small; return(scalb(x,exp)); }
% 		else y=scalb(y,-exp);
% 		return(scalb(sqrt(x*x+y*y),exp));
% 	    }

It does the necessary range reduction, which takes the bulk of the code,
but then just returns sqrt(x*x+y*y) (scaled).  This expression gives a
max error of 1.2+ ULPs for almost all implementations of sqrt() and
FP arithmetic.

>> It's silly that on i386's, naive double precision code would work if
>> the default precision were extended, but the double precision code
>> isn't naive; OTOH, long double precision code can never depend on extra
>> precision, but is naive.
>
> I don't understand to what the above comment is directed.

With extra precision in hardware, the naive expression sqrt(x*x+y*y)
has an error of < 1 ULP, but you propose to use the naive expression
in precisely the case where there is no extra precision in hardware.

>> % Index: src/e_hypotl.c
>> % ...
>> % +	if (n >= m) {
>> % +		a.e *= a.e;
>> % +		m -= n;
>> % +		if (m > - PREC) {
>> % +			b.bits.exp += m;
>> % +			a.e += b.e * b.e;
>> % +		}
>> % +		a.e = sqrtl(a.e);
>> % +		a.bits.exp += n;
>> % +		return (a.e);
>>
>> The usual case reduces to just sqrtl(x*x + y*y).
>
> See my comment above.  The usual case isn't the important case.
> It's the unusual case that is of concern.  The above code prevents
> the unusual, yet provides acceptable performance in the usual case.

Well, even the ifdefed-out version handled the unusual cases.

>> fdlibm is more careful in this case (but similar in others):
>>
>> %  *	So, compute sqrt(x*x+y*y) with some care as
>> %  *	follows to get the error below 1 ulp:
>> %  *
>> %  *	Assume x>y>0;
>> %  *	(if possible, set rounding to round-to-nearest)
>> %  *	1. if x > 2y  use
>> %  *		x1*x1+(y*y+(x2*(x+x1))) for x*x+y*y
>> %  *	where x1 = x with lower 32 bits cleared, x2 = x-x1; else
>> %  *	2. if x <= 2y use
>> %  *		t1*y1+((x-y)*(x-y)+(t1*y2+t2*y))
>> %  *	where t1 = 2x with lower 32 bits cleared, t2 = 2x-t1,
>> %  *	y1= y with lower 32 bits chopped, y2 = y-y1.
>>
>> I don't understand the details of this, and couldn't find any cases
>> where this makes a difference.  Testing of 2-arg functions is hard
>> since exhaustive testing is impossible even for floats, and the
>> cases where this matters for hypot() seem to be sparse.
>
> The above seems more complicated than needed when one considers
> that x and y can be easily split into |x|=a*2**n and |y|=b*2**m
> where f,g are in [0.5,1).   One then has the comment that is
> found in the code:

Now I understand the details.  I left out the most important part of
the comment:

%  * Method : 
%  *	If (assume round-to-nearest) z=x*x+y*y 
%  *	has error less than sqrt(2)/2 ulp, than 
%  *	sqrt(z) has error less than 1 ulp (exercise).
%  *

Do this exercise.  Hints:
o reduce to x in [1, 4) and sqrt(x) in [1, 2)
o all estimates can be made in infinite precision
o the relative and absolute sizes of 1 ULP depend on x and sqrt(x)
o the worst case is x slightly larger than 2

If we compute x*x+y*y naively, then we should expect a max error of
1.5 ULPs (0.5 for each operation).  The actual max error seems to
be 1.0 ULPs (I don't understand why yet).  Even 1 ULP is too much,
since it is larger than the value of sqrt(2)/2 given by the exercise.
The final error is:

 	0.5 + N / (sqrt(2)/2)

where 0.5 ULPs of error result from rounding of sqrt(x*x+y*y) and N ULPs
of error in x*x+y*y contribute less than N ULPs of error to the final
result (according to the scale factor of 1/(sqrt(2)/2), provided N is
not very large -- note that linear scaling here and in the exercise
gives an upper bound for the error, essentially via the linear term
in a Taylor approximation -- we can simply drop the higher  terms without
breaking the estimate since log() is convex).

N = 1 gives the magic number 1.2:

 	0.5 + 1 / (sqrt(2)/2) ~= 0.5 + 0.7071... = 1.2071...

I observed an error of 1.202... in practice for float precision in practice.

The cases with a large error are sparse, so it would be hard to find
them by testing in long double precision.  For a final error of > 1
ULP, x and y must be such that:
- x*x+y*y has an error of > sqrt(2)/2 ULPs
- x*x+y*y must be fairly near the problem value of 2+ (else the relative
   size of an ULP shrinks in the domain and/or range, so sqrt() compresses
   errors more and there is no problem
- x*x+y*y must be very near a half-way case for sqrt().

This analysis also explains why missing guard bits for x*x+y*y don't
seem to be a problem.  We only ignore y*y (or x*x) if this term is <=
2**-(2*PREC) relative to the other term.  Then we may have an error
of more than 0.5 ULPs for the addition, but we only have a tiny error
for the ignored term so the combined error in x*x+y*y is < 1 ULP,
which is no worse than other worst cases.

I found only 1 serious bug n your hypotl(): adding n or m back to the
exponent gives garbage if the resulting exponent doesn't fit.  Fix for
the float precision case:

% 	if (n >= m) {
% 		a.e *= a.e;
% 		m -= n;
% 		if (m > - PREC) {
% 			b.bits.exp += m;
% 			a.e += b.e * b.e;
% 		}
% 		a.e = sqrtf(a.e);
% 		if (a.bits.exp + n <= 0 || a.bits.exp + n >= 0xff) {
% 			a.bits.exp += n - n / 2;
% 			SET_FLOAT_WORD(t, 0x3f800000 + ((n / 2) << 23));
% 			a.e *= t;
% 		} else
% 			a.bits.exp += n;
% 		return (a.e);
% 	} else {
% 		b.e *= b.e;
% 		n -= m;
% 		if (n > - PREC) {
% 			a.bits.exp += n;
% 			b.e += a.e * a.e;
% 		}
% 		b.e = sqrtf(b.e);
% 		if (b.bits.exp + m <= 0 || b.bits.exp + m >= 0xff) {
% 			b.bits.exp += m - m / 2;
% 			SET_FLOAT_WORD(t, 0x3f800000 + ((m / 2) << 23));
% 			b.e *= t;
% 		} else
% 			b.bits.exp += m;
% 		return (b.e);
% 	}

Simpler and slower code can use a multiplier of powf(2.0F, n), etc.

fdlibm hupot() and hypotf() use dynamically constructed multpliers
like the SET_FLOAT_WORD()s in the above, but require less code than
the above for various reasons:
- a and b are swapped to make a >= b so as to reduce duplication
- scaling is different.  In the above, n is decomposed into halves since
   the full n sometimes doesn't fit in the exponent either.  fdlibm
   always ends up with a medium-sized n (spelled k) so that rescaling
   is easier.  The different scaling may be responsible for fdlibm
   being significantly faster (though not in the above -- the no-fit
   case is rare).

Bruce