kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu

Fri Jun 16 13:00:43 UTC 2006

The following reply was made to PR kern/98460; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: Rostislav Krasny <rosti.bsd at gmail.com>
Cc: freebsd-gnats-submit at freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Fri, 16 Jun 2006 22:50:01 +1000 (EST)

 On Fri, 16 Jun 2006, Rostislav Krasny wrote:

 > On Sat, 10 Jun 2006 11:26:20 +1000 (EST)
 > Bruce Evans <bde at zeta.org.au> wrote:
 >
 >> On Fri, 9 Jun 2006, Rostislav Krasny wrote:
 >>
 >>> On Wed, 7 Jun 2006 12:09:10 +1000 (EST)
 >>> Bruce Evans <bde at zeta.org.au> wrote:

 >>>> [on avoiding some branches]
 >>>
 >>> Could you please explain in more detail how that can be done?
 >>
 >> Just do it.  The easiest way is define the new function as inline.
 >> This just works because the function is defined before it is used.
 >>
 >> [snipped]
 >
 > But you still check cpu_fxsr, so a branch misprediction on a good few
 > CPUs is still possible. The only solution is a self-modified code with
 > a direct jump. I made following userland example of such a code:

 Why are we worrying about just this and not all the other branches on
 cpu_fxsr, not to mention all other branches in the kernel :-)?  Note
 that there's another one on cpu_fxsr, in the critical path for npxdna(),
 in fpurstor().  There are also many branches and other unnecessary
 overheads in the trap handling before npxdna() is called.  No one seems
 to be concerned about these.  I sometimes worry about these, and prefer
 my original implementation of i387 DNA handling all in assembler.  It
 takes 12 instructions with 1 branch where in my version of FreeBSD
 Xdna takes 124 instructions with 23 branches (46 instructions with 10
 branches in npxdna()).

 I don't know how common branch misprediction is in npxdna() (or in Xdna
 or trap() or in trap handling generally), but guess it is quite common,
 and fairly common for syscalls too, since traps are not very common
 ind individual syscalls are not very common; thus the CPU is likely
 to have better things to do with memory cache and branch cache resources
 that caching traps or individual syscalls.  But if something is so
 little used that it doesn't stay cached then unnecessarily using it is
 unlikely to make a significant difference to efficiency.

 > [Example of self-modifying code]

 > I think there should be no need in mprotect() in the kernel. That
 > technique could be combined with an assembly version of fpu_clean_state()
 > from following article. See the '"FXRSTOR-centric" method':

 I think Linux is doing this now (perhaps more with nulling out unecessary
 instructions).  Trap handlers can be patched even more easily and
 efficiently by pointing their IDT entry at a machine-dependent optimal
 handler, but as mentioned above FreeBSD does almost the opposite of
 that (it pushes everything through trap()).

 > http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
 >
 > That might be tricky, I know. But why one should pay a performance
 > penalty because of a CPU he/she didn't buy?

 Because the penalty is (?) too small to measure.  I would be interested
 in any measurement that shows otherwise, and generally in any method
 for measuring the cost of branches in code that should not be executed
 very often.  I often do micro-benchmakers by putting sequences of
 instructions in a loop, but this doesn't work right for code that is
 not executed very often.  I haven't looked at performance counter info
 fo a long time.

 Bruce