amd64 cpu_switch in C.

Thu Mar 13 06:22:52 UTC 2008

On Wed, 12 Mar 2008, Peter Wemm wrote:

> On Tue, Mar 11, 2008 at 9:14 PM, David Xu <davidxu at freebsd.org> wrote:
>> Jeff Roberson wrote:
>> > http://people.freebsd.org/~jeff/amd64.diff
>>
>>  This is a good idea.

I wouldn't have expected it to make much difference.  On i386 UP,
cpu_switch() normally executes only 48 instructions for in-kernel
context switches in my version of 5.2 and only 61 instructions in
-current.  ~5.2 differs from 5.2 here in only in not having to
switch %eflags.  This saves 4 instructions but much more in cycles,
especially in P4 where accesses to %eflags are very slow.  5.2 would
take 52 instructions, and -current has bloated by 9 instructions
relative to 5.2.

In-kernel switches are not a very typical case since they don't load
%cr3.  The 50-60 instructions might take as few as 20 cycles when
pipelined through 3 ALUs, but they are only moderately parallelizable
so would take more like 50-60 cycles on an Athlon.  The only very slow
instructions in them for the usual in-kernel case are the loads of
%eflags and %gs.  At least the latter is easy-to optimize away, but
the former is assoicated with spin locking hard-disabling interrupts.
For userland context switches, there is also an ltr in the usual path
of execution.  But 100 or so cycles for the simple instructions is
noise compared with the cost of the TLB flush and other cache misses
caused by loading %cr3 for userland context switches.  Userland code
that does useful work will do more than sched_yield() so it will suffer
more from cache misses.

Layers above cpu_switch() has become very bloated and make a full
context switch take several hundred cycles for the simple instructions
on machines where the simple instructions in cpu_switch() take only
100.  Its overhead may almost be signficant relative to the cache
misses.  However, this is another reason why the speed of the simple
instructions in cpu_switch() doesn't matter.

>>  In fact, according to calling conversion, some
>>  registers are not needed to be saved across function call, e.g on
>>  i386, eax, edx, and ecx. :-) but gdb may need them to dig out
>>  stack variable's value.

The asm code already saves only call-saved registers for both i386 and
amd64.  It saves call-saved registers even when it apparently doesn't
use them (lots more of these on amd64, while on i386 it uses more
call-saved registers than it needs to, apparently since this is free
after saving all call-saved registers).  I think saving more than is
needed is the result of confusion about what needs to be saved and/or
what is needed for debugging.

> Jeff and I have been having a friendly "competition" today.
>
> With a UP kernel and INVARIANTS, my initial counter-patch response had
> nearly double the gain on my machine.  (Jeff 7%, mine: 13.5%).
> I changed to compile kernels the same as he did (no invariants, SMP
> kernel, but kern.smp.disabled=1).  After that, our patch sets were the
> same again - both at about 10% gain over baseline.
>
> I've made a few more changes and am now at 23% improvement over baseline.
>
> I'm not confident of testing methodology.  More tests are in progress.
>
> The good news is that this tuning is finally being done.  It should
> have been done in 2003 though...

How is this possible with (according to my theory) most of the context
switch cost being for %cr3 and upper layers?  Unchanged amd64 has only
a few more costs than i386.  Mainly 3 unconditional wrmsr's and 2
unconditional rdmsr's for managing gsbase and fsbase.  I thought that
these were hard to avoid and anyway not nearly as expensive as %cr3 loads.

Bruce