Complex arg-trig functions

Mon Aug 13 21:40:58 UTC 2012

On 08/13/2012 04:14 PM, Bruce Evans wrote:
> On Mon, 13 Aug 2012, Stephen Montgomery-Smith wrote:

>> The float versions really are much harder than the double and
>> long-double versions.  And it just doesn't seem worth the effort,
>> because who uses them when the double versions are available?
>
> I doubt that they are really harder.

Yes, they are.  In fact, the paper by Hull et al (whose algorithms I 
copied) talked about "bad floating point."  IEEE single precision 
doesn't actually fit their criteria for being bad, but then again I 
didn't exactly follow their algorithm.

> 2. Division tends to be slow, and is slow on x86.  On Phenom, division
>     has a latency of 20 cycles and a throughput of 1 per 15 cycles in
>     scalar double precision (17 in vector[2] double precision).  Maybe
>     the compiler can optimize division by a power of 2 to a multiplication
>     though.

Does the time taken to perform certain floating point operations depend 
upon what the numbers are?  I know that when I do long multiplication by 
hand that (1) it is much faster than long division, and (2) certain 
numbers (like 1000) are very quick to multiply and divide.

Similarly with the computer, is it possible that division by 4 is much 
faster than division by (say) 4.4424242389, and that division by 4 is 
just as fast as multiplication by 0.25?  (And multiplication by 0.25 is 
faster than multiplication than 0.235341212412?)

> So all of those '* 0.5[F]'s can be turned back into '/ 2's (no need to
> depend on the optimization by writing 2.0).

Well, I have already done it.

> This was tracked to a known annoying difference between SSE and i387
> on NaNs.  x+y clears the sign buit on i387 but not on SSE.  SSE is
> correct.

Was it your code that was wrong, or mine?  And if it is mine, what is 
the fix?

> i386 is generally more accurate.  *atan* now below 4 ulps.

I have tested it for very large numbers of random inputs (several 
millions).  It does creep up to about 3.5 ulp.

>
> % rccos: max_er = 0xb7f10439f3111686 24688206287.5958, avg_er =
> 1415.000, #>=1:0.5 = 3881364:4046644

> i387 hardware trig functions are very inaccurate.

But also, what about the problem of when the input is close to one of 
the non-trivial roots of sin, cos, etc?  As a mathematician, I wouldn't 
be shocked if sin(M_PI) was 1e-15 or such like.

> Everything seemed to be passing on sparc64, but it is 300 to 1000 times
> slower on long doubles, so not many tests completed.

Also, as you said earlier, the values of LDBL_MIN and LDBL_EPSILON will 
be different.

>
> Bruce