amd64 cpu_switch in C.
Bruce Evans
brde at optusnet.com.au
Fri Mar 14 03:00:00 UTC 2008
On Thu, 13 Mar 2008, Jeff Roberson wrote:
Please trim quotes more.
> On Fri, 14 Mar 2008, Bruce Evans wrote:
>
>> On Wed, 12 Mar 2008, Jeff Roberson wrote:
>>> More expensive than the raw instruction count is:
>>>
>>> 1) The mispredicted branches to deal with all of the optional state and
>>> features that are not always saved.
>>
>> This is unlikely to matter, and apparently doesn't, at least in simple
>> benchmarks, since the C version has even more branches. Features that
>> are rarely used cause branches that are usually perfectly predicted.
>
> The c version has two fewer branches because it tests for two unlikely
> features together. It has a few more branches than the in cvs asm version
> and the same number of extra branches as peter's asm version to support
> conditional gs/fsbase setting. The other extra branches have to do with
> supporting cpu_switch() and cpu_throw() together.
Testing features together is probably best here, but it might not
always be. Execution more branches might be faster because each
individual branch is easier to predict.
>>> 2) The cost of extra icache for getting over all of those unused
>>> instructions, unaligned jumps, etc.
>>
>> Again, if this were the cause of slowness then it would affect the C
>> version more, since the C version is larger.
>
> The C version is not larger than the asm version at high optimization levels
> when you consider the total instruction count that is brought into the
> icache. It's worth noting that my C version is slower in some cases other
> than the microbenchmark due to extra instructions for optimizations that
> don't matter. Peter's asm version is tight enough that the extra compares
> don't cost more than the compacted code wins. The C version touches more
> distinct icache lines but makes up for it in other optmiizations in the
> common case.
Are calls to rarely-called functions getting auto-inlined for your C
version? THe asm version doesn't worry about this. Even with
auto-inlining of static functions that are only called once (a new
bugfeature in gcc-4.1 which breaks profiling and debugging), at some
optimization levels gcc will place code for the unusual case far away
so as not to pollute the i-cache in the usual case although this may
cost an extra branch in the unusual case. For rarely-called functions,
it must be better to not inline too.
>> In fact, the benchmark is probably too simple to show the cost of
>> branches. Just doing sched_yield() in a loop gives the following
>> atypical behaviour which may be atypical enough for the larger branch
>> and cache costs for the C version to not have much effect:
>> - it doesn't go near most of the special cases, so branches are
>> predictable (always non-special) and are thus predicted provided
>> (a) the CPU actually does reasonably good branch prediction, and
>> (b) the branch predictions fit in the branch prediction cache
>> (reasonably good branch prediction probably requires such a
>> cache).
>
> This cache is surely virtual as it happens in the first few stages of the
> pipeline. That means it's flushed on every switch. We're probably coming in
> cold every time.
Which cache? My perfmon results show that the branch cache is far from cold.
>> The C version uses lots of non-inline function calls. Just the
>> branches for this would have a significant overhead if the branches
>> are mispredicted. I think you are depending on gcc's auto-inlining
>> of static functions which are only called once to avoid the full
>> cost of the function calls.
>
> I depend on it not inlining them to avoid polluting the icache with unused
> instructions. I broke that with my most recent patch by moving the calls
> back into C.
:-) Maybe I only looked at the most recent patch. It seemed to have lots
of calls.
To prevent inlining you probably need to use the noinline attribute for
some functions. I don't see how the C version can be both simpler and
(as|more) optimal than the asm version. It already has magic somewhat
self-documenting macros for branch prediction and magic undocumented
layout for the function calls etc. to improve branch prediction and
icache use. For even-more-micro optimizations in libm, I try to do
everything in C, but the only way I can get near the efficiency that
I want is to look at the asm output and then figure out how to trick
the compiler into not being so stupid. I could optimize it in asm
with less work (starting with the asm output, especially at first to
learn what works for SSE scheduling), but only for a single CPU type.
>> Some perfmon output for ./yield 100000 10:
>> ...
>> % # s/kx-fr-dispatch-stall-for-segment-load % 134520281
>>
>> 134 cyles per call. This may be more for ones in syscall() generally.
>> I think each segreg load still costs ~20 cycles. Since this is on
>> i386, there are 6 per call (%ds, %es and %fs save and restore), plus
>> %ss save and which might not be counted here. 134 is a lot -- about
>> 60nS of the 180nS for getpid().
I forgot about parallelism. With 3-way execution on an Athlon, there
is at least a chance that all 3 segment registers are loaded in parallel,
taking only ~20 cycles for all 3, but no chance of proceeding with
other instructions if so. OTOH, if only 1 or 2 ALUs can do segreg
loads, then the other ALUs may be able to proceed with independent
instructions. We have some nearby instructions that depend on %ds
(these might benefit from using %ss) but few or no nearby dependencies
on %es and %fs. Kernel code mostly doesn't worry about dependencies
at all. Dependencies don't matter as much in integer code as in SSE/FPU
code.
>>> We've been working on amd64 so I can't comment specifically about i386
>>> costs. However, I definitely agree that cpu_switch() is not the greatest
>>> overhead in the path. Also, you have to load cr3 even for kernel threads
>>> because the page directory page or page directory pointer table at %cr3
>>> can go away once you've switched out the old thread.
>>
>> I don't see this. The switch is avoided if %cr3 wouldn't change, which
>> I think usually or always happens for switches between kernel threads.
>
> I see, you're saying 'between kernel threads'. There was some discussion of
> allowing kernel threads to use the page tables of whichever thread was last
> switched in to avoid cr3 in all cases for them. This requires other changes
> to be safe however.
Probably a good idea.
Bruce
More information about the freebsd-arch
mailing list