amd64 cpu_switch in C.

Fri Mar 14 03:00:00 UTC 2008

On Thu, 13 Mar 2008, Jeff Roberson wrote:

Please trim quotes more.

> On Fri, 14 Mar 2008, Bruce Evans wrote:
>
>> On Wed, 12 Mar 2008, Jeff Roberson wrote:

>>> More expensive than the raw instruction count is:
>>> 
>>> 1)  The mispredicted branches to deal with all of the optional state and 
>>> features that are not always saved.
>> 
>> This is unlikely to matter, and apparently doesn't, at least in simple
>> benchmarks, since the C version has even more branches.  Features that
>> are rarely used cause branches that are usually perfectly predicted.
>
> The c version has two fewer branches because it tests for two unlikely 
> features together.  It has a few more branches than the in cvs asm version 
> and the same number of extra branches as peter's asm version to support 
> conditional gs/fsbase setting.  The other extra branches have to do with 
> supporting cpu_switch() and cpu_throw() together.

Testing features together is probably best here, but it might not
always be.  Execution more branches might be faster because each
individual branch is easier to predict.

>>> 2)  The cost of extra icache for getting over all of those unused 
>>> instructions, unaligned jumps, etc.
>> 
>> Again, if this were the cause of slowness then it would affect the C
>> version more, since the C version is larger.
>
> The C version is not larger than the asm version at high optimization levels 
> when you consider the total instruction count that is brought into the 
> icache.  It's worth noting that my C version is slower in some cases other 
> than the microbenchmark due to extra instructions for optimizations that 
> don't matter.  Peter's asm version is tight enough that the extra compares 
> don't cost more than the compacted code wins.  The C version touches more 
> distinct icache lines but makes up for it in other optmiizations in the 
> common case.

Are calls to rarely-called functions getting auto-inlined for your C
version?  THe asm version doesn't worry about this.  Even with
auto-inlining of static functions that are only called once (a new
bugfeature in gcc-4.1 which breaks profiling and debugging), at some
optimization levels gcc will place code for the unusual case far away
so as not to pollute the i-cache in the usual case although this may
cost an extra branch in the unusual case.  For rarely-called functions,
it must be better to not inline too.

>> In fact, the benchmark is probably too simple to show the cost of
>> branches.  Just doing sched_yield() in a loop gives the following
>> atypical behaviour which may be atypical enough for the larger branch
>> and cache costs for the C version to not have much effect:
>> - it doesn't go near most of the special cases, so branches are
>>  predictable (always non-special) and are thus predicted provided
>>  (a) the CPU actually does reasonably good branch prediction, and
>>  (b) the branch predictions fit in the branch prediction cache
>>      (reasonably good branch prediction probably requires such a
>>      cache).
>
> This cache is surely virtual as it happens in the first few stages of the 
> pipeline.  That means it's flushed on every switch.  We're probably coming in 
> cold every time.

Which cache?  My perfmon results show that the branch cache is far from cold.

>> The C version uses lots of non-inline function calls.  Just the
>> branches for this would have a significant overhead if the branches
>> are mispredicted.  I think you are depending on gcc's auto-inlining
>> of static functions which are only called once to avoid the full
>> cost of the function calls.
>
> I depend on it not inlining them to avoid polluting the icache with unused 
> instructions.  I broke that with my most recent patch by moving the calls 
> back into C.

:-) Maybe I only looked at the most recent patch.  It seemed to have lots
of calls.

To prevent inlining you probably need to use the noinline attribute for
some functions.  I don't see how the C version can be both simpler and
(as|more) optimal than the asm version.  It already has magic somewhat
self-documenting macros for branch prediction and magic undocumented 
layout for the function calls etc. to improve branch prediction and
icache use.  For even-more-micro optimizations in libm, I try to do
everything in C, but the only way I can get near the efficiency that
I want is to look at the asm output and then figure out how to trick
the compiler into not being so stupid. I could optimize it in asm
with less work (starting with the asm output, especially at first to
learn what works for SSE scheduling), but only for a single CPU type.

>> Some perfmon output for ./yield 100000 10:
>> ...
>> % # s/kx-fr-dispatch-stall-for-segment-load % 134520281
>> 
>> 134 cyles per call.  This may be more for ones in syscall() generally.
>> I think each segreg load still costs ~20 cycles.  Since this is on
>> i386, there are 6 per call (%ds, %es and %fs save and restore), plus
>> %ss save and which might not be counted here.  134 is a lot -- about
>> 60nS of the 180nS for getpid().

I forgot about parallelism.  With 3-way execution on an Athlon, there
is at least a chance that all 3 segment registers are loaded in parallel,
taking only ~20 cycles for all 3, but no chance of proceeding with
other instructions if so.  OTOH, if only 1 or 2 ALUs can do segreg
loads, then the other ALUs may be able to proceed with independent
instructions.  We have some nearby instructions that depend on %ds
(these might benefit from using %ss) but few or no nearby dependencies
on %es and %fs.  Kernel code mostly doesn't worry about dependencies
at all.  Dependencies don't matter as much in integer code as in SSE/FPU
code.

>>> We've been working on amd64 so I can't comment specifically about i386 
>>> costs. However, I definitely agree that cpu_switch() is not the greatest 
>>> overhead in the path.  Also, you have to load cr3 even for kernel threads 
>>> because the page directory page or page directory pointer table at %cr3 
>>> can go away once you've switched out the old thread.
>> 
>> I don't see this.  The switch is avoided if %cr3 wouldn't change, which
>> I think usually or always happens for switches between kernel threads.
>
> I see, you're saying 'between kernel threads'.  There was some discussion of 
> allowing kernel threads to use the page tables of whichever thread was last 
> switched in to avoid cr3 in all cases for them.  This requires other changes 
> to be safe however.

Probably a good idea.

Bruce