amd64 cpu_switch in C.

Thu Mar 13 07:28:20 UTC 2008

On Thu, 13 Mar 2008, Bruce Evans wrote:

> On Wed, 12 Mar 2008, Peter Wemm wrote:
>
>> On Tue, Mar 11, 2008 at 9:14 PM, David Xu <davidxu at freebsd.org> wrote:
>>> Jeff Roberson wrote:
>>> > http://people.freebsd.org/~jeff/amd64.diff
>>>
>>>  This is a good idea.
>
> I wouldn't have expected it to make much difference.  On i386 UP,
> cpu_switch() normally executes only 48 instructions for in-kernel
> context switches in my version of 5.2 and only 61 instructions in
> -current.  ~5.2 differs from 5.2 here in only in not having to
> switch %eflags.  This saves 4 instructions but much more in cycles,
> especially in P4 where accesses to %eflags are very slow.  5.2 would
> take 52 instructions, and -current has bloated by 9 instructions
> relative to 5.2.

More expensive than the raw instruction count is:

1)  The mispredicted branches to deal with all of the optional state and 
features that are not always saved.
2)  The cost of extra icache for getting over all of those unused 
instructions, unaligned jumps, etc.

I haven't looked at i386 very closely lately but on amd64 the wrmsrs for 
fs/gsbase are very expensive.  On my 2ghz dual core opteron the optimized 
switch seems to take about 100ns.  The total switch from userspace to 
userspace is about 4x that.

>
> In-kernel switches are not a very typical case since they don't load
> %cr3.  The 50-60 instructions might take as few as 20 cycles when
> pipelined through 3 ALUs, but they are only moderately parallelizable
> so would take more like 50-60 cycles on an Athlon.  The only very slow
> instructions in them for the usual in-kernel case are the loads of
> %eflags and %gs.  At least the latter is easy-to optimize away, but
> the former is assoicated with spin locking hard-disabling interrupts.
> For userland context switches, there is also an ltr in the usual path
> of execution.  But 100 or so cycles for the simple instructions is
> noise compared with the cost of the TLB flush and other cache misses
> caused by loading %cr3 for userland context switches.  Userland code
> that does useful work will do more than sched_yield() so it will suffer
> more from cache misses.
>

We've been working on amd64 so I can't comment specifically about i386 
costs.  However, I definitely agree that cpu_switch() is not the greatest 
overhead in the path.  Also, you have to load cr3 even for kernel threads 
because the page directory page or page directory pointer table at %cr3 
can go away once you've switched out the old thread.

> Layers above cpu_switch() has become very bloated and make a full
> context switch take several hundred cycles for the simple instructions
> on machines where the simple instructions in cpu_switch() take only
> 100.  Its overhead may almost be signficant relative to the cache
> misses.  However, this is another reason why the speed of the simple
> instructions in cpu_switch() doesn't matter.
>
>>>  In fact, according to calling conversion, some
>>>  registers are not needed to be saved across function call, e.g on
>>>  i386, eax, edx, and ecx. :-) but gdb may need them to dig out
>>>  stack variable's value.
>
> The asm code already saves only call-saved registers for both i386 and
> amd64.  It saves call-saved registers even when it apparently doesn't
> use them (lots more of these on amd64, while on i386 it uses more
> call-saved registers than it needs to, apparently since this is free
> after saving all call-saved registers).  I think saving more than is
> needed is the result of confusion about what needs to be saved and/or
> what is needed for debugging.

It has to save all of the callee saved registers in the PCB because they 
will likely differ from thread to thread.  Failing to save and restore 
them could leave you returning with the registers having different values 
and corrupt the calling function.

>
>> Jeff and I have been having a friendly "competition" today.
>> 
>> With a UP kernel and INVARIANTS, my initial counter-patch response had
>> nearly double the gain on my machine.  (Jeff 7%, mine: 13.5%).
>> I changed to compile kernels the same as he did (no invariants, SMP
>> kernel, but kern.smp.disabled=1).  After that, our patch sets were the
>> same again - both at about 10% gain over baseline.
>> 
>> I've made a few more changes and am now at 23% improvement over baseline.
>> 
>> I'm not confident of testing methodology.  More tests are in progress.
>> 
>> The good news is that this tuning is finally being done.  It should
>> have been done in 2003 though...
>
> How is this possible with (according to my theory) most of the context
> switch cost being for %cr3 and upper layers?  Unchanged amd64 has only
> a few more costs than i386.  Mainly 3 unconditional wrmsr's and 2
> unconditional rdmsr's for managing gsbase and fsbase.  I thought that
> these were hard to avoid and anyway not nearly as expensive as %cr3 loads.

%cr3 is actually a lot less expensive these days with page table flush 
filters and the PG_G bit.  We were able to optimize away setting the msrs 
in the case that the previous values match the new values.  Apparently the 
hardware doesn't optimize this case so we have to do comparisons 
ourselves.

That was a big chunk of the optimization.  Static branch hints, reordering 
code, possibly reordering for better pipeline scheduling in peter's asm, 
etc. provide the rest.

My primary motivation is to get ithread/kthread/taskqueue switch costs 
down for interrupt heavy applications.  There is a lot of unnecessary fat 
there.

Jeff

>
> Bruce
> _______________________________________________
> freebsd-arch at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe at freebsd.org"
>