cvs commit: src/sys/amd64/amd64 cpu_switch.S
attilio at freebsd.org
Thu Jun 19 15:45:58 UTC 2008
2008/6/19, Attilio Rao <attilio at freebsd.org>:
> 2008/3/24, Peter Wemm <peter at freebsd.org>:
> > peter 2008-03-23 23:09:06 UTC
> > FreeBSD src repository
> > Modified files:
> > sys/amd64/amd64 cpu_switch.S
> > Log:
> > First pass at (possibly futile) microoptimizing of cpu_switch. Results
> > are mixed. Some pure context switch microbenchmarks show up to 29%
> > improvement. Pipe based context switch microbenchmarks show up to 7%
> > improvement. Real world tests are far less impressive as they are
> > dominated more by actual work than switch overheads, but depending on
> > the machine in question, workload, kernel options, phase of moon, etc, a
> > few percent gain might be seen.
> > Summary of changes:
> > - don't reload MSR_[FG]SBASE registers when context switching between
> > non-threaded userland apps. These typically cost 120 clock cycles each
> > on an AMD cpu (less on Barcelona/Phenom). Intel cores are probably no
> > faster on this.
> > - The above change only helps unthreaded userland apps that tend to use
> > the same value for gsbase. Threaded apps will get no benefit from this.
> > - reorder things like accessing the pcb to be in memory order, to give
> > prefetching a better chance of working. Operations are now in increasing
> > memory address order, rather than reverse or random.
> > - Push some lesser used code out of the main code paths. Hopefully
> > allowing better code density in cache lines. This is probably futile.
> > - (part 2 of previous item) Reorder code so that branches have a more
> > realistic static branch prediction hint. Both Intel and AMD cpus
> > default to predicting branches to lower memory addresses as being
> > taken, and to higher memory addresses as not being taken. This is
> > overridden by the limited dynamic branch prediction subsystem. A trip
> > through userland might overflow this.
> > - Futule attempt at spreading the use of the results of previous operations
> > in new operations. Hopefully this will allow the cpus to execute in
> > parallel better.
> > - stop wasting 16 bytes at the top of kernel stack, below the PCB.
> > - Never load the userland fs/gsbase registers for kthreads, but preserve
> > curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)
> > Microbenchmarking this code seems to be really sensitive to things like
> > scheduling luck, timing, cache behavior, tlb behavior, kernel options,
> > other random code changes, etc.
> > While it doesn't help heavy userland workloads much, it does help high
> > context switch loads a little, and should help those that involve
> > switching via kthreads a bit more.
> > A special thanks to Kris for the testing and reality checks, and Jeff for
> > tormenting me into doing this. :)
> > This is still work-in-progress.
> It looks like this patch introduces a regression.
> In particular, this chunk:
> @@ -181,82 +166,138 @@ sw1:
> cmpq %rcx, %rdx
> je 1b
> - lfence
> is not totally right as we want to enforce an acq
...an acq memory barrier in order to handle correctly an eventual
We could use this approach, that is what I implemented on ia32 in
order to solve the same problem:
#define BLOCK_SPIN(reg) \
movl $blocked_lock,%eax ; \
100: ; \
lock ; \
cmpxchgl %eax,TD_LOCK(reg) ; \
jne 101f ; \
pause ; \
jmp 100b ; \
[Sorry if I pushed "send" wrongly]
Peace can only be achieved by understanding - A. Einstein
More information about the cvs-src