kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu

Sat Jun 17 21:40:21 UTC 2006

The following reply was made to PR kern/98460; it has been noted by GNATS.

From: Rostislav Krasny <rosti.bsd at gmail.com>
To: Bruce Evans <bde at zeta.org.au>
Cc: freebsd-gnats-submit at FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Sun, 18 Jun 2006 00:31:55 +0300

 On Sat, 17 Jun 2006 17:01:27 +1000 (EST)
 Bruce Evans <bde at zeta.org.au> wrote:

 > On Fri, 16 Jun 2006, Rostislav Krasny wrote:
 > 
 > > On Fri, 16 Jun 2006 22:50:01 +1000 (EST)
 > > Bruce Evans <bde at zeta.org.au> wrote:
 > >
 > >> Why are we worrying about just this and not all the other branches on
 > >> cpu_fxsr, not to mention all other branches in the kernel :-)?
 > >
 > > I think it is a matter of principle. AMD saved few microcomands in
 > > their incorrect implementation of two Pentium III instructions. And now
 > > buyers if their processors are paying much more than those few
 > > microcomands.
 > 
 > No, the non-AMD users pay much less (unless the cost of branch prediction
 > is very large).  When I tried to measure the overhead for the fix, I found
 > that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
 > Athlon(XP,64).  That's about 150 cycles longer IIRC.  The fix costs only
 > 14 cycles.

 Yes, according to
 http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
 the "FXRSTOR-centric" method takes 14 cycles on AMD Opteron processor.
 That is the minimum which AMD users need to pay now. Non-AMD users have
 four options:

 1. run the same instructions down the drain
 2. test some flag
 3. jump over these instructions
 4. disable these instructions in the kernel build configuration

 Now, how much it will cost them:

 1. same 14 cycles (?)
 2. minimum 20 cycles on NetBurst or about 15 cycles on Pentium III
    http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm?prn=Y
    plus 1 or 2 microcomands for BT or TEST instruction.
 3. 1 microcomand for one direct JMP
 4. nothing

 The last option has the best performance cost but kernel build options
 are unhandy. Implementation of the third option is simple. Why not to
 do it? Only one byte of the code will be self-modified.

 > These measurements were in microbenchmarks that loop (and in manuals
 > that assume similar best-case setups).  The extra 150 cycles is free
 > if it is done in parallel with integer operations.  npxdna() only does
 > the fxrstor half and has limited parallelism, and I haven't measured
 > how many of the extra 150/2 cycles are free (probably none).  14 cycles
 > for the fix assumes no branch misprediction.
 > 
 > 14 cycles is a lot from one point of view, but from a practical point
 > of view it is the same as 0.  Suppose that the kernel does 1000 context
 > switches per second per CPU (too many for efficiency since it thrashes
 > caches), and that an FPU switch occurs on all of these (it would
 > normally be much less than that since half of all context switches are
 > often to kernel threds (and half back), and many threads don't use the
 > FPU.  We then waste 14000 cycles per second + more for branch misprediction
 > and other cache effects.  At 2GHz 14000 cycles is a whole 7uS.

 How many cycles a context switch normally takes? About 1000 cycles?
 Then 14 - 20 additional cycles take 1.4% - 2% of the previous context
 switch time. Why to waste it?

 > > Why should buyers of processors from other manufacturers,
 > > which implemented FXSAVE and FXRSTOR correctly, pay even a tiny bit of
 > > their performance for nothing?
 > 
 > Because they can't measure the difference?

 From a practical point of view that wastage could looks minor, but from
 a principle point of view it's not.

 By the way, how many cycles will be saved by converting the
 fpu_clean_state() function to an inline code?