Complex arg-trig functions

Tue Sep 18 06:19:26 UTC 2012

On Mon, 17 Sep 2012, Stephen Montgomery-Smith wrote:

> OK, I am struggling a bit with the latest suggestions.
>
> First, I have completely removed all the code related to when |z| is small. 
> I have just lost it all.  So I didn't perform any changes related to that 
> code.  If you want me to put it back with appropriate "#if 0", can you email 
> those code segments back to me?

I have not :-).  It is also quoted in the mail archives.  Will sent it in
my next patch.

> On 09/17/2012 12:07 PM, Bruce Evans wrote:
>
>> @ @@ -321,30 +334,28 @@
>> @      }
>> @ @ -    if (isinf(x) || isinf(y))
>> @ -        return (cpackf(copysignf(0, x), copysignf(m_pi_2 + tiny, y)));
>> @ +    /* Raise inexact unless z == 0; return for z == 0 as a side
>> effect. */
>> @ +    if ((x == 0 && y == 0) || (int)(1 + tiny) != 1)
>> @ +        return (z);
>
> I'm not too sure where this code is meant to be.  It looks like it should be 
> part of testing |z| small, but it seems to be placed where |z| is large. 
> When |z| is large, z=0 will never happen.

As its comment says, this raises inexact [up front] unless z == 0, and
[so that the test is not optimized away] returns [z] for z == 0 as a side
effect.

This is for any z that has not been previously classified (mainly ones
with NaNs).  Its operation is:
- z == 0: find x == 0 and y == 0 and return z
- z != 0: find !(x == 0 and y == 0); evaluate (int)(1 + tiny) != 1 and
   find it to be false while raising inexact; don't return z, but continue
   with inexact set.

>> cacos*() and casin*() should benefit even more from an up-front raising
>> of inexact, since do_hard_work() has 7 magic statements to raise inexact
>> where sum_squares has only 1.
>
> Where is the code that raises inexact up-front?

As quoted above.

Later I tried removing all the 7 magic statements in do_hard_work(), without
adding code like the above.  This made very little difference.  OTOH, the
above code costs a cycle or 2, and removing the additions in all magic
expressions (m_pi_2 + tiny) gave a small improvement.  I think I can
explain this, and it shows that we should be using fenv (optimized) and
not "optimizing" using magic constext-sensitive expressions.  The point
is that the code that sets inexact can run in parallel so that the main
path can run faster because it doesn't involve an operation like
(m_pi_2 + tiny).

Good ways for raising exceptions:

FE_INEXACT:
Your if '((int)(1 + tiny) == 1) return (foo);' works well.  This depends
on the branch being predictable.  But returning is inconvenient.  I
hope if '((int)(1 + tiny) == 1) volatile_variable = 0;' works similarly.
This could be in feraiseexcept(FE_INEXACT) or in a more primitive
raise_inexact() (the latter is less verbose and easier to optimize).
Then if you actually want to return, the code would be something like
{ raise_inexact(); return (m_pi_2); }.  Better, the branch can be
avoided using something like `volatile_variable = (int)(1 + tiny);'.
Better still, write this in asm and just do `(int)0.5;' (use asm only
to avoid the optimizer removing this).  Possibly better still, use a
purer FP operation since conversion to int can be slow.

In the above, we don't really want a special case for z == 0; we need
branches to classify this case but should skip the return since returns
use branch resources too.  The code becomes:

 	if (x != 0 || y != 0)
 		raise_inexact();	/* No comment. */
#if !THIS_CODE_INTENTIONALLY_LEFT_OUT
 	else
 		return (z);		/* No comment. */
#endif

FE_OVERFLOW:
Instead of evaluating huge*huge and returning it, use something like
`volatile_variable = huge*huge; return (INFINITY);'.  This is more
natural than the above, so it takes at most 1 more instruction
(assignment to variable with no dependents) and thus loses little
even without parallelism.  The version written in asm can also avoid
the assignment (just evaluate huge*huge) and lose nothing.

FE_UNDERFLOW:
Instead of evaluating tiny*tiny and returning it, use something like
`volatile_variable = tiny*tiny; return (0);'.  I hope there is a
variation on this that raises underflow at full speed (underflowing
cases are very slow on core2 although not on Athlon64; hopefully they
are not so slow if the result of tiny*tiny is not used).

The last 2 raisings will also fix the i386 bug that huge*huge and
tiny*tiny don't actually raise overflow or underflow or return infinity
or 0, since they are evaluated in extra exponent range.  It takes
conversion to double or float to trigger the exception and to give
the correct value.  When we try to raise exceptions in a parallel
code path, we are hoping for related asynchronicities in the setting
of the exception flags so that the usual case where the exception
flags are not tested soon proceeds at full speed.  It is unclear
how compilers and CPUs produce the ordering of operations required
by the abstract C machines -- I think a strict interpretation of
`volatile' would require synchronizing everything for every access
to a volatile variable, but that would be too slow and I've never
seen compilers doing much synchronization.

>> @      } else
>> @          rx = log1pf(4*ax / sum_squares(ax-1, ay)) / 4;
>> @ @ -    if (ax == 1) {
>> @ -        if (ay==0)
>> @ -            ry = 0;
>> @ -        else
>> @ -            ry = atan2f(2, -ay) / 2;
>> @ -    } else if (ay < FOUR_SQRT_MIN) {
>> @ -        if ((int)ay==0)
>> @ -            ry = atan2f(2*ay, (1-ax)*(1+ax)) / 2;
>> @ -    } else
>> @ +    if (ax == 1)
>> @ +        ry = atan2f(2, ay) / 2;
>> @ +    else if (ay < FOUR_SQRT_MIN)
>> @ +        ry = atan2f(2*ay, (1-ax)*(1+ax)) / 2;
>> @ +    else
>> @          ry = atan2f(2*ay, (1-ax)*(1+ax) - ay*ay) / 2;
>> @
>> 
>> Remove the special case for (ax == 1, ay == 0).  The general case gives
>> the same result.
>
> I don't think your code works.  It should be ry = atan2f(2, -ay) / 2, not ry 
> = atan2f(2, ay) / 2.

Only logically.  As I explained, the negation makes no difference to the
result, but of course takes longer, so I removed it.

> In your tests, you should include cases where x or y is equal or close to 1. 
> These are important special cases that I think your test code is very 
> unlikely to hit.  These are difficult edge cases for all the arc-trig 
> functions.

Hmm, I only did this carefully for clog().  I happen to have been testing
lots of cases ctanh(1 + tiny, tiny') where tiny* is really tiny (denormal)
with either sign, but not so many cases ctanh(1, tiny') and no (?) cases
of ctanh(1 + tiny, +-0).

>> Remove negation of ay for ax == 1.  The sign will be copied into the result
>> later for all cases, so it doesn't matter in the arg.  I didn't check the
>> branch cut details for this, but runtime tests passed.
>
> See above.

I might have missed this.  But if the sign matters, why do you set ry = +0
for catanh on both sides of 1 + I*(+-0)?

Bruce