kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
disabled
for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Bruce Evans
bde at zeta.org.au
Sun Jun 18 03:30:30 UTC 2006
The following reply was made to PR kern/98460; it has been noted by GNATS.
From: Bruce Evans <bde at zeta.org.au>
To: Rostislav Krasny <rosti.bsd at gmail.com>
Cc: freebsd-gnats-submit at freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sun, 18 Jun 2006 13:30:09 +1000 (EST)
On Sun, 18 Jun 2006, Rostislav Krasny wrote:
> On Sat, 17 Jun 2006 17:01:27 +1000 (EST)
> Bruce Evans <bde at zeta.org.au> wrote:
>
>> On Fri, 16 Jun 2006, Rostislav Krasny wrote:
>>> ,,,
>>> I think it is a matter of principle. AMD saved few microcomands in
>>> their incorrect implementation of two Pentium III instructions. And now
>>> buyers if their processors are paying much more than those few
>>> microcomands.
>>
>> No, the non-AMD users pay much less (unless the cost of branch prediction
>> is very large). When I tried to measure the overhead for the fix, I found
>> that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
>> Athlon(XP,64). That's about 150 cycles longer IIRC. The fix costs only
>> 14 cycles.
>
> Yes, according to
> http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
> the "FXRSTOR-centric" method takes 14 cycles on AMD Opteron processor.
> That is the minimum which AMD users need to pay now. Non-AMD users have
> four options:
I confirmed the ~14 cycle value in a micro-benchmark but don't really
believe it. The difficulty of accounting for cache misses of various
types (perhaps main branch target cache here) is shown partly by the
AMD statement not even mentioning caches.
> 1. run the same instructions down the drain
> 2. test some flag
> 3. jump over these instructions
> 4. disable these instructions in the kernel build configuration
5. Replace these instructions by no-op instructions. (This can be done
at no cost for many bytes of instructions on CPUs with micro-ops, but
but costs up to 2 (?) cycles per byte on old i386's.)
6. Change the pointer to Xdna in the IDT to a pointer to a version
without these instructions.
7. Change Xdna (and/or routines that it calls, preferably none) to a
version without these or hundreds of other instructions.
8. Do some of the above for all branches and/or routine in the kernel
to avoid hundreds of thousands of branches and other instructions.
9. Use another method to expolit parallelism better. fldl after fxsave
is probably better for parallelism.
> Now, how much it will cost them:
>
> 1. same 14 cycles (?)
> 2. minimum 20 cycles on NetBurst or about 15 cycles on Pentium III
> http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm?prn=Y
> plus 1 or 2 microcomands for BT or TEST instruction.
> 3. 1 microcomand for one direct JMP
> 4. nothing
1. Possibly 14, probably more, but possibly less due to parallelism.
2. Now at most 2 on modern CPUs under the same bad assumptions that
give 14 for (1).
3. Direct jumps sometimes take just as long as conditional jumps on
some CPUs (I think due to them not beng cached), but if something
is sure to take only a single micro-op then there's a good chance
of parallelism.
4. Probably, but possibly not since the extra code might accidentally
improve instruction scheduling :-).
5. Like (3), except no-ops may reduce to 0 micro-ops instead of 1 and
thus take 0 execution resources but some prefetch resources.
6. Like (4).
7. Like (6) repeated 50 times. Xdna could take 20 times fewer instructions
but wouldn't be 20 times faster because the slow fxrstor instruction
would dominate.
8. I think the potential savings from this huge task are about 10% for
the kernel and some fraction of this for the system.
9. "fxsave; testl $FLAG,cpu_fxsr; jz 1f; fnstsw ...; cmp ...; jz; fnclex;
fldl ...; 1:".
Now the cpu_fxsr test and even the status test might be free even if
there are a branch misprediction since there are no important data
dependencies. If the CPU has enough execution units than it can do
the following in parallel:
FPU1 ALU1 FPU2 ALU[2-] FPU[2-]
---- ---- ---- ------- -------
fxsave testl $FLAG,cpu_fxr idle runs ahead runs ahead
... jz 1f idle ... ...
... ... fnstsw
... cmp ...
... jz
... runs ahead
fnclex ...
fldl
runs ahead
...
Some serializing instruction, probably iret:
iret iret iret iret iret
If the CPU soon returns to user mode then it will hit a serializing
instruction soon, so it is important to start the slow fxsave instruction
as early as possible so that everything doesn't have to wait for it.
The npxsave() call in cpu_switch() was written about 13 years ago and
the i386 cpu_switch() is more like 20 years old. It knows nothing
about multiple execution units and happens to schedule the npx switch
(actually the save half of a switch) almost perfectly pessimally by
doing it near the end. However, mi_switch() has a lot of bloat so
this probably doesn't matter -- the fxsave+fnclex sequence will complete
before the bloat gets through the integer ALUs.
I don't know if modern CPUs have this much parallelism. My (old, paper)
AthlonXP optimization manual says that fnstsw runs in the FSTORE pipe and
doesn't say which pipe(s) fxsave runs in, so I guess fnstsw has to wait
for fxsave. You would like this since AthlonXPs would have to wait but
Pentiums would proceed on all except ALU1 and FPU1 :-).
> The last option has the best performance cost but kernel build options
> are unhandy. Implementation of the third option is simple. Why not to
> do it? Only one byte of the code will be self-modified.
Because modifying only 1 byte in a 5MB library (the kernel) for a larger
application (userland) would make little difference.
>> 14 cycles is a lot from one point of view, but from a practical point
>> of view it is the same as 0. Suppose that the kernel does 1000 context
>> switches per second per CPU (too many for efficiency since it thrashes
>> caches), and that an FPU switch occurs on all of these (it would
>> normally be much less than that since half of all context switches are
>> often to kernel threds (and half back), and many threads don't use the
>> FPU. We then waste 14000 cycles per second + more for branch misprediction
>> and other cache effects. At 2GHz 14000 cycles is a whole 7uS.
>
> How many cycles a context switch normally takes? About 1000 cycles?
> Then 14 - 20 additional cycles take 1.4% - 2% of the previous context
> switch time. Why to waste it?
More like 2000 (best case). It was more like 1000 as recently as RELENG_4,
but there have been many branches since then. On My AthlonXP @2223 MHz
with a TSC timecounter, according to LMbench:
% L M B E N C H 2 . 0 S U M M A R Y
% ------------------------------------
%
% Context switching - times in microseconds - smaller is better
% -------------------------------------------------------------
% Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
% ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
% --------- ------------- ----- ------ ------ ------ ------ ------- -------
% epsplex.b FreeBSD 4.10- 0.370 0.6800 7.9100 2.2800 14.1 4.62000 55.9
% epsplex.b FreeBSD 5.2-C 0.830 1.3600 8.6200 3.2900 24.7 4.28000 58.5
0.370 uS is 823 cycles and 0.830 uS is 1845 cycles. The variance of
these times is about 5%. LMbench's context switching doesn't exercise
the FPU.
Bruce
More information about the freebsd-bugs
mailing list