Complex arg-trig functions

Mon Aug 13 21:14:33 UTC 2012

On Mon, 13 Aug 2012, Stephen Montgomery-Smith wrote:

> On 08/13/2012 11:57 AM, Bruce Evans wrote:
>
>> I finally tested a version of this.  I only did simple comparisons (float
>> vs double and double vs long double).  The results look promising after
>> fixing a few bugs:
>
> Thank you very much for doing the testing, and for fixing the bugs.
>
>> 
>> % amd64 float prec, on 2**12 * 2**12 args:
>
>> % icacosh:max_er = 0x3690000000 436.5000, avg_er = 0.317, #>=1:0.5 =
>> 29104:255732
>
>> There are negative reasons to have the float versions unless they are not
>> wrappers.  The reasons to have non-wrappers are to test the algorithm and
>> run faster.
>
> That large max-err for the imaginary part of icacosh for float bothers me. 
> It means that I haven't thought it through properly.  Could you send me the 
> input values that created this error(s)?

It was just a bug in my test program, for scaling denormals.  Some cases
have an infinite-precision result that is very close to the smallest
denormal.  When this is rounded imperfectly to 0, implementation details
give a special case.  I had just fixed this for +0, but forget to mask
out the sign bit for testing -0.  Somehow this didn't turn up for other
functions.

> The float versions really are much harder than the double and long-double 
> versions.  And it just doesn't seem worth the effort, because who uses them 
> when the double versions are available?

I doubt that they are really harder.

> In my case, not for speed.  Because on my machine the float versions are 
> slightly slower than the double version.

The double constants might do that on amd64.  'r *= 0.5F;' would compile
to a single multiplication, where r and 0.5F are already in registers.
'r *= 0.5;' would compile to: convert r to double; multiply; convert
back to float.  Conversions are slower than multiplications.  On Phenom,
double <-> float conversion has a latency of 7 (?) cycles, while
multiplication has a latency of 4 cycles; both have a throughput of
1 instruction/cycle.  Sometimes latencies can be hidden, but here there
are 3 dependent operations.

> Also, you made the comment that in the float version, all the 0.5 should 
> become 0.5F.  Two questions:
> 1.  Doesn't the compiler do this conversion for me?
> 2.  What is wrong with using x/2 instead of 0.5*x?  You told me in a far 
> earlier email to use 0.5*x.  (Similarly in one place I have a 0.25.)

1. It can't, since x * 0.5F is quite different from x * 0.5 when x is
    float and float expressions are evaluated in float precision.  The
    former is a float expression, and for example overflows when x =
    FLT_MAX.  The latter is a double expression, so it doesn't overflow
    when x = FLT_MAX.  Maybe the compiler could optimize it to a float
    expression if its result is immediately assigned to a float variable
    or cast to float, but this is not a very easy optimization -- the
    compiler would have to prove that it has the right side effects for
    all possible values of x.  The side effect of overflow happens on
    assignment.

2. Division tends to be slow, and is slow on x86.  On Phenom, division
    has a latency of 20 cycles and a throughput of 1 per 15 cycles in
    scalar double precision (17 in vector[2] double precision).  Maybe
    the compiler can optimize division by a power of 2 to a multiplication
    though.

In fact, gcc on x86 does both of these optimizations.  With float x,
and double y, in a function returning float, 'return x/2.0;' gets
turned into float multiplication by 0.5F, but 'y = x/2.0; return y;'
only have the division optimization (it converts x to double, multiplies
doubles, stores to y, and converts to float).

So all of those '* 0.5[F]'s can be turned back into '/ 2's (no need to
depend on the optimization by writing 2.0).

Optimization of [long] double constants to float (or double) constants
is more routine.  When you write 'static const long double one = 1;',
gcc normally puts a long double at the address of `one' (this is visible
to debuggers), but never actually uses it it uses either fld1 on i387
or an unnamed float constant with the same value.  At least on i387;
loading a float constant automatically extends it and is much faster
than loading an (already-extended) long double constant; however, on
amd64 a float constant would have to be converted to double for use
in double expressions.

I added old complex functions to the simple test run.  See the attachment
for the full list.

% amd64 float prec, on 2**12 x 2**12 args:
% % ...
% rctan: max_er = 0x9708bb4b 4.7198, avg_er = 0.104, #>=1:0.5 = 177976:980356
% rctanh:max_er = 0x9c7f786c 4.8906, avg_er = 0.159, #>=1:0.5 = 635992:2189748
% ...
% ictan: max_er = 0x9c7f786c 4.8906, avg_er = 0.159, #>=1:0.5 = 635992:2189748
% ictanh:max_er = 0x9708bb4b 4.7198, avg_er = 0.104, #>=1:0.5 = 177976:980356

All (amd64 float prec) below 4 ulps except these.  Have to check my
denormal scaling again.

% amd64 double prec, on 2**12 x 2**12 args:
% ...
% rcatan:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.212, #>=1:0.5 = 4681:81365
% rcatanh:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.047, #>=1:0.5 = 428997:691341
% ...
% icatan:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.047, #>=1:0.5 = 428997:691341
% icatanh:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.212, #>=1:0.5 = 4681:81365
% ...
% icsqrt:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.476, #>=1:0.5 = 14279:7662849

This was tracked to a known annoying difference between SSE and i387
on NaNs.  x+y clears the sign buit on i387 but not on SSE.  SSE is
correct.

% i386 float prec, on 2**12 x 2**12 args:
% ...
% rcatan:max_er = 0x3c4078ec 1.8829, avg_er = 0.284, #>=1:0.5 = 13260:190836
% rcatanh:max_er = 0x2cef3171 1.4042, avg_er = 0.167, #>=1:0.5 = 4096:420916

i386 is generally more accurate.  *atan* now below 4 ulps.

% rccos: max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1415.000, #>=1:0.5 = 3881364:4046644
% rccosh:max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1415.000, #>=1:0.5 = 3881364:4046644
% rcexp: max_er = 0xb7e10439f3111686 24679817679.5958, avg_er = 312.908, #>=1:0.5 = 3884684:4078258
% rcsin: max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 807.518, #>=1:0.5 = 3841652:4115524
% rcsinh:max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1693.508, #>=1:0.5 = 3880876:4368000
% rctan: max_er = 0xb68d1b1c5151fcc0 24501606626.5413, avg_er = 1627.536, #>=1:0.5 = 3890932:4214696
% rctanh:max_er = 0x125ead9c2ea875d 154097358.0911, avg_er = 1416.635, #>=1:0.5 = 2897540:3523200
% ...
% iccos: max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 1477.124, #>=1:0.5 = 3824832:4420052
% iccosh:max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 1477.124, #>=1:0.5 = 3824832:4420052
% icexp: max_er = 0xb7fbf11b402ee934 24693934298.0057, avg_er = 1736.298, #>=1:0.5 = 3839256:4219864
% icsin: max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1693.508, #>=1:0.5 = 3880876:4368000
% icsinh:max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 807.518, #>=1:0.5 = 3841652:4115524
% ictan: max_er = 0x125ead9c2ea875d 154097358.0911, avg_er = 1416.635, #>=1:0.5 = 2897540:3523200
% ictanh:max_er = 0xb68d1b1c5151fcc0 24501606626.5413, avg_er = 1632.536, #>=1:0.5 = 3890988:4214752

i387 hardware trig functions are very inaccurate.

% sparc64 double prec, on 2**12 x 2**12 args:
% rcacos:max_er =     0x36b5 3.4192, avg_er = 0.228, #>=1:0.5 = 2394:125984
% rcacosh:max_er =     0x1f15 1.9426, avg_er = 0.258, #>=1:0.5 = 8464:2737832
% rcasin:max_er =     0x2b8c 2.7217, avg_er = 0.113, #>=1:0.5 = 33296:99148
% rcasinh:max_er =     0x1f15 1.9426, avg_er = 0.258, #>=1:0.5 = 8464:2737800
% rcatan:max_er =     0x2c46 2.7671, avg_er = 0.212, #>=1:0.5 = 4680:81364
% rcatanh:max_er =     0x2970 2.5898, avg_er = 0.047, #>=1:0.5 = 129452:691356
% rclog: max_er =      0xe09 0.8772, avg_er = 0.250, #>=1:0.5 = 0:18348

Everything seemed to be passing on sparc64, but it is 300 to 1000 times
slower on long doubles, so not many tests completed.

Bruce
-------------- next part --------------
amd64 float prec, on 2**12 x 2**12 args:
rcacos:max_er = 0x58460841 2.7585, avg_er = 0.317, #>=1:0.5 = 29084:255712
rcacosh:max_er = 0x5e1e45e6 2.9412, avg_er = 0.262, #>=1:0.5 = 85868:3413684
rcasin:max_er = 0x631b8183 3.0971, avg_er = 0.209, #>=1:0.5 = 38388:382508
rcasinh:max_er = 0x5e1e45e6 2.9412, avg_er = 0.262, #>=1:0.5 = 85868:3413684
rcatan:max_er = 0x51d7c47a 2.5576, avg_er = 0.290, #>=1:0.5 = 52984:318084
rcatanh:max_er = 0x73424d52 3.6018, avg_er = 0.205, #>=1:0.5 = 212424:1437580
rclog: max_er = 0x26dfae4d 1.2148, avg_er = 0.247, #>=1:0.5 = 184:92244
rcsqrt:max_er =  0xffffaf0 0.5000, avg_er = 0.292, #>=1:0.5 = 0:0
rccos: max_er = 0x533351ba 2.6000, avg_er = 0.088, #>=1:0.5 = 58932:312088
rccosh:max_er = 0x533351ba 2.6000, avg_er = 0.088, #>=1:0.5 = 58932:312088
rcexp: max_er = 0x42bf37e5 2.0858, avg_er = 0.096, #>=1:0.5 = 59076:443904
rcsin: max_er = 0x5220e04b 2.5665, avg_er = 0.095, #>=1:0.5 = 84564:468480
rcsinh:max_er = 0x45badce8 2.1791, avg_er = 0.107, #>=1:0.5 = 64220:1208464
rctan: max_er = 0x9708bb4b 4.7198, avg_er = 0.104, #>=1:0.5 = 177976:980356
rctanh:max_er = 0x9c7f786c 4.8906, avg_er = 0.159, #>=1:0.5 = 635992:2189748
icacos:max_er = 0x5e1e45e6 2.9412, avg_er = 0.262, #>=1:0.5 = 85868:3413684
icacosh:max_er = 0x58460841 2.7585, avg_er = 0.317, #>=1:0.5 = 29084:255712
icasin:max_er = 0x5e1e45e6 2.9412, avg_er = 0.262, #>=1:0.5 = 85868:3413684
icasinh:max_er = 0x631b8183 3.0971, avg_er = 0.209, #>=1:0.5 = 38388:382508
icatan:max_er = 0x73424d52 3.6018, avg_er = 0.205, #>=1:0.5 = 212424:1437580
icatanh:max_er = 0x51d7c47a 2.5576, avg_er = 0.290, #>=1:0.5 = 52984:318084
iclog: max_er = 0x1fc2b4f5 0.9925, avg_er = 0.302, #>=1:0.5 = 0:349830
icsqrt:max_er =  0xffffaf0 0.5000, avg_er = 0.292, #>=1:0.5 = 0:0
iccos: max_er = 0x46d56e6b 2.2136, avg_er = 0.139, #>=1:0.5 = 83600:1357204
iccosh:max_er = 0x46d56e6b 2.2136, avg_er = 0.139, #>=1:0.5 = 83600:1357204
icexp: max_er = 0x3fd1e8de 1.9944, avg_er = 0.104, #>=1:0.5 = 80738:650660
icsin: max_er = 0x45badce8 2.1791, avg_er = 0.107, #>=1:0.5 = 64220:1208464
icsinh:max_er = 0x5220e04b 2.5665, avg_er = 0.095, #>=1:0.5 = 84564:468480
ictan: max_er = 0x9c7f786c 4.8906, avg_er = 0.159, #>=1:0.5 = 635992:2189748
ictanh:max_er = 0x9708bb4b 4.7198, avg_er = 0.104, #>=1:0.5 = 177976:980356

amd64 double prec, on 2**12 x 2**12 args:
rcacos:max_er =     0x1b5a 3.4189, avg_er = 0.228, #>=1:0.5 = 2394:125988
rcacosh:max_er =      0xf7d 1.9360, avg_er = 0.257, #>=1:0.5 = 612:2741860
rcasin:max_er =     0x15c5 2.7212, avg_er = 0.113, #>=1:0.5 = 33296:99152
rcasinh:max_er =      0xf7d 1.9360, avg_er = 0.257, #>=1:0.5 = 612:2741796
rcatan:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.212, #>=1:0.5 = 4681:81365
rcatanh:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.047, #>=1:0.5 = 428997:691341
rclog: max_er =      0x704 0.8770, avg_er = 0.250, #>=1:0.5 = 0:20152
rcsqrt:max_er =    0x11594 34.6973, avg_er = 0.476, #>=1:0.5 = 12232:7660802
icacos:max_er =      0xf7d 1.9360, avg_er = 0.257, #>=1:0.5 = 612:2741860
icacosh:max_er =     0x1b5a 3.4189, avg_er = 0.228, #>=1:0.5 = 2394:125988
icasin:max_er =      0xf7d 1.9360, avg_er = 0.257, #>=1:0.5 = 612:2741796
icasinh:max_er =     0x15c5 2.7212, avg_er = 0.113, #>=1:0.5 = 33296:99152
icatan:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.047, #>=1:0.5 = 428997:691341
icatanh:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.212, #>=1:0.5 = 4681:81365
iclog: max_er =      0x6f4 0.8691, avg_er = 0.213, #>=1:0.5 = 0:181032
icsqrt:max_er = 0x8000000000000000 4503599627370496.0000, avg_er = 268435456.476, #>=1:0.5 = 14279:7662849

i386 float prec, on 2**12 x 2**12 args:
rcacos:max_er = 0x42cd19c6 2.0875, avg_er = 0.314, #>=1:0.5 = 3854:215116
rcacosh:max_er = 0x3170e232 1.5450, avg_er = 0.254, #>=1:0.5 = 23008:3245028
rcasin:max_er = 0x55adc0df 2.6775, avg_er = 0.208, #>=1:0.5 = 34304:353980
rcasinh:max_er = 0x3170e232 1.5450, avg_er = 0.254, #>=1:0.5 = 23008:3245028
rcatan:max_er = 0x3c4078ec 1.8829, avg_er = 0.284, #>=1:0.5 = 13260:190836
rcatanh:max_er = 0x2cef3171 1.4042, avg_er = 0.167, #>=1:0.5 = 4096:420916
rclog: max_er = 0x25830853 1.1722, avg_er = 0.246, #>=1:0.5 = 120:24892
rcsqrt:max_er =  0xffffaf0 0.5000, avg_er = 0.292, #>=1:0.5 = 0:0
rccos: max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1415.000, #>=1:0.5 = 3881364:4046644
rccosh:max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1415.000, #>=1:0.5 = 3881364:4046644
rcexp: max_er = 0xb7e10439f3111686 24679817679.5958, avg_er = 312.908, #>=1:0.5 = 3884684:4078258
rcsin: max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 807.518, #>=1:0.5 = 3841652:4115524
rcsinh:max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1693.508, #>=1:0.5 = 3880876:4368000
rctan: max_er = 0xb68d1b1c5151fcc0 24501606626.5413, avg_er = 1627.536, #>=1:0.5 = 3890932:4214696
rctanh:max_er = 0x125ead9c2ea875d 154097358.0911, avg_er = 1416.635, #>=1:0.5 = 2897540:3523200
icacos:max_er = 0x3170e232 1.5450, avg_er = 0.254, #>=1:0.5 = 23008:3245028
icacosh:max_er = 0x42cd19c6 2.0875, avg_er = 0.314, #>=1:0.5 = 3854:215116
icasin:max_er = 0x3170e232 1.5450, avg_er = 0.254, #>=1:0.5 = 23008:3245028
icasinh:max_er = 0x55adc0df 2.6775, avg_er = 0.208, #>=1:0.5 = 34304:353980
icatan:max_er = 0x2cef3171 1.4042, avg_er = 0.167, #>=1:0.5 = 4096:420916
icatanh:max_er = 0x3c4078ec 1.8829, avg_er = 0.284, #>=1:0.5 = 13260:190836
iclog: max_er = 0x1fc2b4f5 0.9925, avg_er = 0.302, #>=1:0.5 = 0:338712
icsqrt:max_er =  0xffffaf0 0.5000, avg_er = 0.292, #>=1:0.5 = 0:0
iccos: max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 1477.124, #>=1:0.5 = 3824832:4420052
iccosh:max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 1477.124, #>=1:0.5 = 3824832:4420052
icexp: max_er = 0xb7fbf11b402ee934 24693934298.0057, avg_er = 1736.298, #>=1:0.5 = 3839256:4219864
icsin: max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = 1693.508, #>=1:0.5 = 3880876:4368000
icsinh:max_er = 0xb80bf11b402ee934 24702322906.0057, avg_er = 807.518, #>=1:0.5 = 3841652:4115524
ictan: max_er = 0x125ead9c2ea875d 154097358.0911, avg_er = 1416.635, #>=1:0.5 = 2897540:3523200
ictanh:max_er = 0xb68d1b1c5151fcc0 24501606626.5413, avg_er = 1632.536, #>=1:0.5 = 3890988:4214752

i386 double prec, on 2**12 x 2**12 args:
rcacos:max_er =     0x11e8 2.2383, avg_er = 0.165, #>=1:0.5 = 248:111850
rcacosh:max_er =      0xb02 1.3760, avg_er = 0.256, #>=1:0.5 = 104:2715312
rcasin:max_er =     0x13ce 2.4756, avg_er = 0.112, #>=1:0.5 = 5616:95060
rcasinh:max_er =      0xb02 1.3760, avg_er = 0.256, #>=1:0.5 = 104:2715312
rcatan:max_er =      0x9ed 1.2407, avg_er = 0.015, #>=1:0.5 = 4084:48920
rcatanh:max_er =      0xb17 1.3862, avg_er = 0.014, #>=1:0.5 = 56:77456
rclog: max_er =      0x704 0.8770, avg_er = 0.250, #>=1:0.5 = 0:20112
rcsqrt:max_er =      0x7aa 0.9580, avg_er = 0.399, #>=1:0.5 = 0:4104
icacos:max_er =      0xb02 1.3760, avg_er = 0.256, #>=1:0.5 = 104:2715312
icacosh:max_er =     0x11e8 2.2383, avg_er = 0.165, #>=1:0.5 = 248:111850
icasin:max_er =      0xb02 1.3760, avg_er = 0.256, #>=1:0.5 = 104:2715312
icasinh:max_er =     0x13ce 2.4756, avg_er = 0.112, #>=1:0.5 = 5616:95060
icatan:max_er =      0xb17 1.3862, avg_er = 0.014, #>=1:0.5 = 56:77456
icatanh:max_er =      0x9ed 1.2407, avg_er = 0.015, #>=1:0.5 = 4084:48920
iclog: max_er =      0x6f4 0.8691, avg_er = 0.213, #>=1:0.5 = 0:181032
icsqrt:max_er =      0x400 0.5000, avg_er = 0.399, #>=1:0.5 = 0:4026

sparc64 double prec, on 2**12 x 2**12 args:
rcacos:max_er =     0x36b5 3.4192, avg_er = 0.228, #>=1:0.5 = 2394:125984
rcacosh:max_er =     0x1f15 1.9426, avg_er = 0.258, #>=1:0.5 = 8464:2737832
rcasin:max_er =     0x2b8c 2.7217, avg_er = 0.113, #>=1:0.5 = 33296:99148
rcasinh:max_er =     0x1f15 1.9426, avg_er = 0.258, #>=1:0.5 = 8464:2737800
rcatan:max_er =     0x2c46 2.7671, avg_er = 0.212, #>=1:0.5 = 4680:81364
rcatanh:max_er =     0x2970 2.5898, avg_er = 0.047, #>=1:0.5 = 129452:691356
rclog: max_er =      0xe09 0.8772, avg_er = 0.250, #>=1:0.5 = 0:18348