cvs commit: src/sys/amd64/amd64 cpu_switch.S

Thu Jun 19 15:42:04 UTC 2008

2008/3/24, Peter Wemm <peter at freebsd.org>:
> peter       2008-03-23 23:09:06 UTC
>
>   FreeBSD src repository
>
>   Modified files:
>     sys/amd64/amd64      cpu_switch.S
>   Log:
>   First pass at (possibly futile) microoptimizing of cpu_switch.  Results
>   are mixed.  Some pure context switch microbenchmarks show up to 29%
>   improvement.  Pipe based context switch microbenchmarks show up to 7%
>   improvement.  Real world tests are far less impressive as they are
>   dominated more by actual work than switch overheads, but depending on
>   the machine in question, workload, kernel options, phase of moon, etc, a
>   few percent gain might be seen.
>
>   Summary of changes:
>   - don't reload MSR_[FG]SBASE registers when context switching between
>     non-threaded userland apps.  These typically cost 120 clock cycles each
>     on an AMD cpu (less on Barcelona/Phenom).  Intel cores are probably no
>     faster on this.
>   - The above change only helps unthreaded userland apps that tend to use
>     the same value for gsbase.  Threaded apps will get no benefit from this.
>   - reorder things like accessing the pcb to be in memory order, to give
>     prefetching a better chance of working.  Operations are now in increasing
>     memory address order, rather than reverse or random.
>   - Push some lesser used code out of the main code paths.  Hopefully
>     allowing better code density in cache lines.  This is probably futile.
>   - (part 2 of previous item) Reorder code so that branches have a more
>     realistic static branch prediction hint.  Both Intel and AMD cpus
>     default to predicting branches to lower memory addresses as being
>     taken, and to higher memory addresses as not being taken.  This is
>     overridden by the limited dynamic branch prediction subsystem.  A trip
>     through userland might overflow this.
>   - Futule attempt at spreading the use of the results of previous operations
>     in new operations.  Hopefully this will allow the cpus to execute in
>     parallel better.
>   - stop wasting 16 bytes at the top of kernel stack, below the PCB.
>   - Never load the userland fs/gsbase registers for kthreads, but preserve
>     curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)
>
>   Microbenchmarking this code seems to be really sensitive to things like
>   scheduling luck, timing, cache behavior, tlb behavior, kernel options,
>   other random code changes, etc.
>
>   While it doesn't help heavy userland workloads much, it does help high
>   context switch loads a little, and should help those that involve
>   switching via kthreads a bit more.
>
>   A special thanks to Kris for the testing and reality checks, and Jeff for
>   tormenting me into doing this. :)
>
>   This is still work-in-progress.

It looks like this patch introduces a regression.
In particular, this chunk:

@@ -181,82 +166,138 @@ sw1:
 	cmpq	%rcx, %rdx
 	pause
 	je	1b
-	lfence
 #endif

is not totally right as we want to enforce an acq
-- 
Peace can only be achieved by understanding - A. Einstein