kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
disabled
for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Bruce Evans
bde at zeta.org.au
Sat Jun 17 07:10:20 UTC 2006
The following reply was made to PR kern/98460; it has been noted by GNATS.
From: Bruce Evans <bde at zeta.org.au>
To: Rostislav Krasny <rosti.bsd at gmail.com>
Cc: freebsd-gnats-submit at FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sat, 17 Jun 2006 17:01:27 +1000 (EST)
On Fri, 16 Jun 2006, Rostislav Krasny wrote:
> On Fri, 16 Jun 2006 22:50:01 +1000 (EST)
> Bruce Evans <bde at zeta.org.au> wrote:
>
>> Why are we worrying about just this and not all the other branches on
>> cpu_fxsr, not to mention all other branches in the kernel :-)?
>
> I think it is a matter of principle. AMD saved few microcomands in
> their incorrect implementation of two Pentium III instructions. And now
> buyers if their processors are paying much more than those few
> microcomands.
No, the non-AMD users pay much less (unless the cost of branch prediction
is very large). When I tried to measure the overhead for the fix, I found
that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
Athlon(XP,64). That's about 150 cycles longer IIRC. The fix costs only
14 cycles.
These measurements were in microbenchmarks that loop (and in manuals
that assume similar best-case setups). The extra 150 cycles is free
if it is done in parallel with integer operations. npxdna() only does
the fxrstor half and has limited parallelism, and I haven't measured
how many of the extra 150/2 cycles are free (probably none). 14 cycles
for the fix assumes no branch misprediction.
14 cycles is a lot from one point of view, but from a practical point
of view it is the same as 0. Suppose that the kernel does 1000 context
switches per second per CPU (too many for efficiency since it thrashes
caches), and that an FPU switch occurs on all of these (it would
normally be much less than that since half of all context switches are
often to kernel threds (and half back), and many threads don't use the
FPU. We then waste 14000 cycles per second + more for branch misprediction
and other cache effects. At 2GHz 14000 cycles is a whole 7uS.
> Why should buyers of processors from other manufacturers,
> which implemented FXSAVE and FXRSTOR correctly, pay even a tiny bit of
> their performance for nothing?
Because they can't measure the difference?
I think that unless you modify millions of branches, there is more to be
gained from things like scheduling instructions so that high-latency
instructions like fxrstor are started early, but the gains here are still
relatively small and are better done by compliers and CPUs because the
best scheduling is machine-dependent.
Bruce
More information about the freebsd-bugs
mailing list