Fwd: 5-STABLE kernel build with icc broken

Wed Mar 30 23:18:13 PST 2005

On Wed, 30 Mar 2005, David Schultz wrote:

> On Wed, Mar 30, 2005, Peter Jeremy wrote:
>> On Tue, 2005-Mar-29 22:57:28 -0500, jason henson wrote:
>>> Later in that thread they discuss skipping the restore state to make
>>> things faster.  The minimum buffer size they say this will be good for
>>> is between 2-4k.  Does this make sense, or am I showing my ignorance?
>>>
>>> http://leaf.dragonflybsd.org/mailarchive/kernel/2004-04/msg00264.html
>>
>> Yes.  There are a variety of options for saving/restoring FPU state:
>> a) save FPU state on kernel entry
>> b) save FPU state on a context switch (or if the kernel wants the FPU)
>> c) only save FPU state if a different process (or the kernel) wants the FPU
>> 1) restore FPU on kernel exit
>> 2) restore FPU state if a process wants the FPU.
>>
>> a and 1 are the most obvious - that's the way the integer registers are
>> handled.
>>
>> I thought FreeBSD used to be c2 but it seems it has been changed to b2
>> since I looked last.

No, it always used b2.  I never got around to implementing c2.

Linux used to implement c2 on i386's, but I think it switched (to b2?) to
optimize (or at least simplify) the SMP case.

>> Based on the mail above, it looks like Dfly was changed from 1 to 2
>> (I'm not sure if it's 'a' or 'c' on save).

'a' seems to be too inefficient to ever use.  '1' makes sense if it
rarely happens and/or the kernel can often use the FPU more than once
per entry (which it probably shouldn't), but it gives complications
like the ones for SMP, especially in FreeBSD where the kernel can be
preempted.

Saving FP state as needed is simplest but can be slow.  My Athlon-with-
SSE-extensions pagecopy and pagezero routines use the FPU (XMM) but
their FP state save isn't slow because only 1 or 2 XMM registers needs
to be saved.  E.g., the saving part of sse_pagezero_for_some_athlons() is:

 	pushfl			# Also have to save %eflags.
 	cli			# Switch %eflags as needed to safely use FPU.
 	movl	%cr0,%eax	# Also have to save %cr0.
 	clts			# Switch %cr0 as needed to use FPU.
 	subl	$16,%esp	# Space to save some FP state.
 	movups	%xmm0,(%esp)	# Save some FP state.  Only this much needed.

>> On the i386 (and probably most other CPUs), you can place the FPU into
>> am "unavailable" state.  This means that any attempt to use it will
>> trigger a trap.  The kernel will then restore FPU state and return.
>> On a normal system call, if the FPU hasn't been used, the kernel will
>> see that it's still in an "unavailable" state and can avoid saving the
>> state.  (On an i386, "unavailable" state is achieved by either setting
>> CR0_TS or CR0_EM).  This means you avoid having to always restore FPU
>> state at the expense of an additional trap if the process actually
>> uses the FPU.

I remember that you (Peter) did extensive benchmarks of this.  I still
think fully lazy switching (c2) is the best general method.  Maybe FP
state should be loaded in advance based on FPU affinity.  It might be
good for CPU affinity to depend on FPU use (prfer not to switch
threads away from a CPU if they own that CPU via its FPU).

> This is basically what FreeBSD does on i386 and amd64.  (As a
> disclaimer, I haven't read the code very carefully, so I might be
> missing some of the details.)  Upon taking a trap for a process
> that has never used the FPU before, we save the FPU state for the
> last process to use the FPU, then load a fresh FPU state.  On

We don't save the FPU state for the last thread then (c2 behaviour)
since we have already saved it it when we switched away from it.
npxdna() panics if we haven't done that.  Except rev.1.131 added bogus
code (apparently to debug or hide bugs in the other changes in rev.1.131)
that breaks the panic in the fpcurthread == curthread case.

> subsequent context switches, the FPU state for processes that have
> already used the FPU gets loaded before entering user mode, I
> think.  I haven't studied the code in enough detail to know what

No, that doesn't happen.  Instead, cpu_switch() has called npxsave()
on the context switch away from the thread.  npxsave() arranges for
a trap on the next use of the FPU, and we don't do anything more with
the FPU context of the thread until the thread tries to use the FPU
(in userland).  Then we take the trap and load the saved context in
npxdna().

> happens for SMP, where a process could be scheduled on a different
> processor before its FPU state is saved on the first processor.

There is no difference for SMP, but there would be large complicated
differences if we did fully lazy saving.  npxdna() would have to do
something like sending an IPI to the thread that owns the FPU if
that thread could be different from curthread.  This would be slow,
but might be worth doing if it didn't happen much and if lazy fully
lazy context switching were a significant advantage.  I think it
could be arranged to not happen much, but the advantage is insignificant.

BTW, David and I recently found a bug in the context switching in the
fxsr case, at least on Athlon-XP's and AMD64's.  At least the AMD64
is documented to not save/restore the last instruction pointers and
opcode in fxsave/fxrstor unless the processor considers save/restore
to be necessary (which is if there is an unmasked exception).  But
this behaviour is inconsistent with what is needed for actually
saving and restoring the FP state on context switches.  The bug can
be seen in gdb -- the pointers and opcode tend to be always 0, but
sometimes there is an unmasked exception and then the pointers
sometimes get set correctly for the thread that caused the exception,
and they get set to different garbage than 0 for other threads.
The garbage is more obvious when it is read using fnstenv or fsave
directly in userland.  These instructions are not optimized like
fxsave, so they show the actual pointers and opcode.  Since context
switches rarely actually switch the pointers and opcode, the actual
values tend to be wrong after context switches.

Bruce