kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu

Sun Jun 18 03:30:30 UTC 2006

The following reply was made to PR kern/98460; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: Rostislav Krasny <rosti.bsd at gmail.com>
Cc: freebsd-gnats-submit at freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sun, 18 Jun 2006 13:30:09 +1000 (EST)

 On Sun, 18 Jun 2006, Rostislav Krasny wrote:

 > On Sat, 17 Jun 2006 17:01:27 +1000 (EST)
 > Bruce Evans <bde at zeta.org.au> wrote:
 >
 >> On Fri, 16 Jun 2006, Rostislav Krasny wrote:

 >>> ,,,
 >>> I think it is a matter of principle. AMD saved few microcomands in
 >>> their incorrect implementation of two Pentium III instructions. And now
 >>> buyers if their processors are paying much more than those few
 >>> microcomands.
 >>
 >> No, the non-AMD users pay much less (unless the cost of branch prediction
 >> is very large).  When I tried to measure the overhead for the fix, I found
 >> that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
 >> Athlon(XP,64).  That's about 150 cycles longer IIRC.  The fix costs only
 >> 14 cycles.
 >
 > Yes, according to
 > http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
 > the "FXRSTOR-centric" method takes 14 cycles on AMD Opteron processor.
 > That is the minimum which AMD users need to pay now. Non-AMD users have
 > four options:

 I confirmed the ~14 cycle value in a micro-benchmark but don't really
 believe it.  The difficulty of accounting for cache misses of various
 types (perhaps main branch target cache here) is shown partly by the
 AMD statement not even mentioning caches.

 > 1. run the same instructions down the drain
 > 2. test some flag
 > 3. jump over these instructions
 > 4. disable these instructions in the kernel build configuration

 5. Replace these instructions by no-op instructions.  (This can be done
     at no cost for many bytes of instructions on CPUs with micro-ops, but
     but costs up to 2 (?) cycles per byte on old i386's.)
 6. Change the pointer to Xdna in the IDT to a pointer to a version
     without these instructions.
 7. Change Xdna (and/or routines that it calls, preferably none) to a
     version without these or hundreds of other instructions.
 8. Do some of the above for all branches and/or routine in the kernel
     to avoid hundreds of thousands of branches and other instructions.
 9. Use another method to expolit parallelism better.  fldl after fxsave
     is probably better for parallelism.

 > Now, how much it will cost them:
 >
 > 1. same 14 cycles (?)
 > 2. minimum 20 cycles on NetBurst or about 15 cycles on Pentium III
 >   http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm?prn=Y
 >   plus 1 or 2 microcomands for BT or TEST instruction.
 > 3. 1 microcomand for one direct JMP
 > 4. nothing

 1. Possibly 14, probably more, but possibly less due to parallelism.
 2. Now at most 2 on modern CPUs under the same bad assumptions that
     give 14 for (1).
 3. Direct jumps sometimes take just as long as conditional jumps on
     some CPUs (I think due to them not beng cached), but if something
     is sure to take only a single micro-op then there's a good chance
     of parallelism.
 4. Probably, but possibly not since the extra code might accidentally
     improve instruction scheduling :-).
 5. Like (3), except no-ops may reduce to 0 micro-ops instead of 1 and
     thus take 0 execution resources but some prefetch resources.
 6. Like (4).
 7. Like (6) repeated 50 times.  Xdna could take 20 times fewer instructions
     but wouldn't be 20 times faster because the slow fxrstor instruction
     would dominate.
 8. I think the potential savings from this huge task are about 10% for
     the kernel and some fraction of this for the system.
 9. "fxsave; testl $FLAG,cpu_fxsr; jz 1f; fnstsw ...; cmp ...; jz; fnclex;
      fldl ...; 1:".
     Now the cpu_fxsr test and even the status test might be free even if
     there are a branch misprediction since there are no important data
     dependencies.  If the CPU has enough execution units than it can do
     the following in parallel:

        FPU1         ALU1                    FPU2        ALU[2-]      FPU[2-]
        ----         ----                    ----        -------      -------
        fxsave       testl $FLAG,cpu_fxr     idle        runs ahead   runs ahead
        ...          jz 1f                   idle        ...          ...
        ...          ...                     fnstsw
        ...          cmp                     ...
        ...          jz
        ...          runs ahead
        fnclex       ...
        fldl
        runs ahead
        ...
     Some serializing instruction, probably iret:
        iret         iret                    iret        iret         iret

 If the CPU soon returns to user mode then it will hit a serializing
 instruction soon, so it is important to start the slow fxsave instruction
 as early as possible so that everything doesn't have to wait for it.
 The npxsave() call in cpu_switch() was written about 13 years ago and
 the i386 cpu_switch() is more like 20 years old.  It knows nothing
 about multiple execution units and happens to schedule the npx switch
 (actually the save half of a switch) almost perfectly pessimally by
 doing it near the end.  However, mi_switch() has a lot of bloat so
 this probably doesn't matter -- the fxsave+fnclex sequence will complete
 before the bloat gets through the integer ALUs.

 I don't know if modern CPUs have this much parallelism.  My (old, paper)
 AthlonXP optimization manual says that fnstsw runs in the FSTORE pipe and
 doesn't say which pipe(s) fxsave runs in, so I guess fnstsw has to wait
 for fxsave.  You would like this since AthlonXPs would have to wait but
 Pentiums would proceed on all except ALU1 and FPU1 :-).

 > The last option has the best performance cost but kernel build options
 > are unhandy. Implementation of the third option is simple. Why not to
 > do it? Only one byte of the code will be self-modified.

 Because modifying only 1 byte in a 5MB library (the kernel) for a larger
 application (userland) would make little difference.

 >> 14 cycles is a lot from one point of view, but from a practical point
 >> of view it is the same as 0.  Suppose that the kernel does 1000 context
 >> switches per second per CPU (too many for efficiency since it thrashes
 >> caches), and that an FPU switch occurs on all of these (it would
 >> normally be much less than that since half of all context switches are
 >> often to kernel threds (and half back), and many threads don't use the
 >> FPU.  We then waste 14000 cycles per second + more for branch misprediction
 >> and other cache effects.  At 2GHz 14000 cycles is a whole 7uS.
 >
 > How many cycles a context switch normally takes? About 1000 cycles?
 > Then 14 - 20 additional cycles take 1.4% - 2% of the previous context
 > switch time. Why to waste it?

 More like 2000 (best case).  It was more like 1000 as recently as RELENG_4,
 but there have been many branches since then.  On My AthlonXP @2223 MHz
 with a TSC timecounter, according to LMbench:

 %                  L M B E N C H  2 . 0   S U M M A R Y
 %                  ------------------------------------
 % 
 % Context switching - times in microseconds - smaller is better
 % -------------------------------------------------------------
 % Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 %                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
 % --------- ------------- ----- ------ ------ ------ ------ ------- -------
 % epsplex.b FreeBSD 4.10- 0.370 0.6800 7.9100 2.2800   14.1 4.62000    55.9
 % epsplex.b FreeBSD 5.2-C 0.830 1.3600 8.6200 3.2900   24.7 4.28000    58.5

 0.370 uS is 823 cycles and 0.830 uS is 1845 cycles.  The variance of
 these times is about 5%.  LMbench's context switching doesn't exercise
 the FPU.

 Bruce