  Rearranged the polynomial evaluation to reduce dependencies, as in
  k_tanf.c but with different details.
  The polynomial is odd with degree 13 for tanf() and odd with degree
  9 for sinf(), so the details are not very different for sinf() -- the
  term with the x**11 and x**13 coefficients goes awaym and (mysteriously)
  it helps to do the evaluation of w = z*z early although moving it later
  was a key optimization for tanf().  The details are different but simpler
  for cosf() because the polynomial is even and of lower degree.
  On Athlons, for uniformly distributed args in [-2pi, 2pi], this gives
  an optimization of about 4 cycles (10%) in most cases (13% for sinf()
  on AXP, but 0% for cosf() with gcc-3.3 -O1 on AXP).  The best case
  (sinf() with gcc-3.4 -O1 -fcaller-saves on A64) now takes 33-39 cycles
  (was 37-45 cycles).  Hardware sinf takes 74-129 cycles.  Despite
  being fine tuned for Athlons, the optimization is even larger on
  some other arches (about 15% on ia64 (pluto2) and 20% on alpha (beast)
  with gcc -O2 -fomit-frame-pointer).
