kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu

Sat Jun 17 07:10:20 UTC 2006

The following reply was made to PR kern/98460; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: Rostislav Krasny <rosti.bsd at gmail.com>
Cc: freebsd-gnats-submit at FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sat, 17 Jun 2006 17:01:27 +1000 (EST)

 On Fri, 16 Jun 2006, Rostislav Krasny wrote:

 > On Fri, 16 Jun 2006 22:50:01 +1000 (EST)
 > Bruce Evans <bde at zeta.org.au> wrote:
 >
 >> Why are we worrying about just this and not all the other branches on
 >> cpu_fxsr, not to mention all other branches in the kernel :-)?
 >
 > I think it is a matter of principle. AMD saved few microcomands in
 > their incorrect implementation of two Pentium III instructions. And now
 > buyers if their processors are paying much more than those few
 > microcomands.

 No, the non-AMD users pay much less (unless the cost of branch prediction
 is very large).  When I tried to measure the overhead for the fix, I found
 that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
 Athlon(XP,64).  That's about 150 cycles longer IIRC.  The fix costs only
 14 cycles.

 These measurements were in microbenchmarks that loop (and in manuals
 that assume similar best-case setups).  The extra 150 cycles is free
 if it is done in parallel with integer operations.  npxdna() only does
 the fxrstor half and has limited parallelism, and I haven't measured
 how many of the extra 150/2 cycles are free (probably none).  14 cycles
 for the fix assumes no branch misprediction.

 14 cycles is a lot from one point of view, but from a practical point
 of view it is the same as 0.  Suppose that the kernel does 1000 context
 switches per second per CPU (too many for efficiency since it thrashes
 caches), and that an FPU switch occurs on all of these (it would
 normally be much less than that since half of all context switches are
 often to kernel threds (and half back), and many threads don't use the
 FPU.  We then waste 14000 cycles per second + more for branch misprediction
 and other cache effects.  At 2GHz 14000 cycles is a whole 7uS.

 > Why should buyers of processors from other manufacturers,
 > which implemented FXSAVE and FXRSTOR correctly, pay even a tiny bit of
 > their performance for nothing?

 Because they can't measure the difference?

 I think that unless you modify millions of branches, there is more to be
 gained from things like scheduling instructions so that high-latency
 instructions like fxrstor are started early, but the gains here are still
 relatively small and are better done by compliers and CPUs because the
 best scheduling is machine-dependent.

 Bruce