[PATCH] improve accuracy and speed of tanh[f] for x in [0:1)

Sat Feb 26 18:30:45 UTC 2011

The patch that follows my signature improves both the
accuracy and speed of tanhf and tanh in the interval
[0,1).  The testing was done on an Intel Core2 Duo CPU
T7250 and everything is compiled with gcc.  The
improvements are accomplished by using a trancated 
continued fraction instead of relying on expm1[f] and
a rewritten definition of tanh(x).  The results can 
be summarized by

      |  fdlibm   | Con. frac.
-------------------------------
      | ulp  time | ulp  time
tanhf | 1.66 1.36 | 0.56 1.31
 tanh | 2.16 5.65 | 1.53 5.33
tanhl | 2.36 4.72 | 1.53 5.12

ulp in the above table is the maximum ulp observed for
2 million evenly distributed values with the interval.
The timing for tanhf is for 5 million calls and the
timing for tanh is for 20 million calls, where again the
values are evenly distributed across the interval.  The
timings are the total time in seconds for a tight loop.

For those paying attention, yes, the last entry is for
tanhl, which libm does not currently implement.  I have
an implementation that uses expm1l and fdlibm's simple
rewritten tanh(x) definition.  I, of course, have a 
truncated continued fraction implementation as well.

Now for those that are really paying attention, libm
does not contain an expm1l. :)

-- 
Steve

Index: src/s_tanh.c
===================================================================

--- src/s_tanh.c	(revision 219046)
+++ src/s_tanh.c	(working copy)
@@ -66,8 +66,33 @@
 		t = expm1(two*fabs(x));
 		z = one - two/(t+two);
 	    } else {
+#if 0
+		/*
+		 * In the interval, one has a max ulp of 2.156 for 2M
+		 * evenly distributed values.  For timing, 20M call
+		 * gives 5.65 s.
+		 */
 	        t = expm1(-two*fabs(x));
 	        z= -t/(t+two);
+#else
+		/*
+		 * This is a continued fraction representation that
+		 * was truncated at the 10th level and then refractored
+		 * to get rid of the divisions.  In the interval, one
+		 * has a max ulp of 1.532 for 2M evenly distributed
+		 * values.  For timing, 20M calls give 5.33 s.
+		 */
+		double x2, t1, t2, t3, t4, t5;
+		x2 = x * x;
+		t1 = 399 + 20 * x2;
+		t3 = 17 * t1 + (21 + x2) * x2;
+		t4 = 15 * t3 + t1 * x2;
+		t5 = 13 * t4 + t3 * x2;
+		t2 = 99 * t5 + (9 * t4 + t5) * x2;
+		t3 = 7 * t2 + (11 * t5 + t4 * x2) * x2;
+		z = x / (1 + x2 / (3 + t3 * x2 / (5 * t3 + t2 * x2)));
+		return (z);
+#endif
 	    }
     /* |x| >= 22, return +-1 */
 	} else {
Index: src/s_tanhf.c
===================================================================
--- src/s_tanhf.c	(revision 219046)
+++ src/s_tanhf.c	(working copy)
@@ -44,8 +44,29 @@
 		t = expm1f(two*fabsf(x));
 		z = one - two/(t+two);
 	    } else {
+#if 0
+		/*
+		 * In the interval, one has a max ulp of 1.66 for 2M
+		 * evenly distributed values.  For timing, 5M calls
+		 * give 1.36 s.
+		 */
 	        t = expm1f(-two*fabsf(x));
 	        z= -t/(t+two);
+#else
+		/*
+		 * This is a continued fraction representation that
+		 * was truncated at the 6th level and then refractored
+		 * to get rid of the divisions.  In the interval, one
+		 * has a max ulp of 0.559 for 2M evenly distributed
+		 * values.  For timing, 5M calls give 1.306 s.
+		 */
+		float x2, t1, t2, t3;
+		x2 = x * x;
+		t1 = 11 + x2;
+		t2 = 9 * t1 + x2;
+		t3 = 7 * t2 + t1 * x2;
+		z = x / (1 + x2 / (3 + t3 * x2 / (5 * t3 + t2 * x2 )));
+#endif
 	    }
     /* |x| >= 9, return +-1 */
 	} else {