cvs commit: src/sys/amd64/amd64 cpu_switch.S

Thu Jun 19 15:45:58 UTC 2008

2008/6/19, Attilio Rao <attilio at freebsd.org>:
> 2008/3/24, Peter Wemm <peter at freebsd.org>:
>
> > peter       2008-03-23 23:09:06 UTC
>  >
>  >   FreeBSD src repository
>  >
>  >   Modified files:
>  >     sys/amd64/amd64      cpu_switch.S
>  >   Log:
>  >   First pass at (possibly futile) microoptimizing of cpu_switch.  Results
>  >   are mixed.  Some pure context switch microbenchmarks show up to 29%
>  >   improvement.  Pipe based context switch microbenchmarks show up to 7%
>  >   improvement.  Real world tests are far less impressive as they are
>  >   dominated more by actual work than switch overheads, but depending on
>  >   the machine in question, workload, kernel options, phase of moon, etc, a
>  >   few percent gain might be seen.
>  >
>  >   Summary of changes:
>  >   - don't reload MSR_[FG]SBASE registers when context switching between
>  >     non-threaded userland apps.  These typically cost 120 clock cycles each
>  >     on an AMD cpu (less on Barcelona/Phenom).  Intel cores are probably no
>  >     faster on this.
>  >   - The above change only helps unthreaded userland apps that tend to use
>  >     the same value for gsbase.  Threaded apps will get no benefit from this.
>  >   - reorder things like accessing the pcb to be in memory order, to give
>  >     prefetching a better chance of working.  Operations are now in increasing
>  >     memory address order, rather than reverse or random.
>  >   - Push some lesser used code out of the main code paths.  Hopefully
>  >     allowing better code density in cache lines.  This is probably futile.
>  >   - (part 2 of previous item) Reorder code so that branches have a more
>  >     realistic static branch prediction hint.  Both Intel and AMD cpus
>  >     default to predicting branches to lower memory addresses as being
>  >     taken, and to higher memory addresses as not being taken.  This is
>  >     overridden by the limited dynamic branch prediction subsystem.  A trip
>  >     through userland might overflow this.
>  >   - Futule attempt at spreading the use of the results of previous operations
>  >     in new operations.  Hopefully this will allow the cpus to execute in
>  >     parallel better.
>  >   - stop wasting 16 bytes at the top of kernel stack, below the PCB.
>  >   - Never load the userland fs/gsbase registers for kthreads, but preserve
>  >     curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)
>  >
>  >   Microbenchmarking this code seems to be really sensitive to things like
>  >   scheduling luck, timing, cache behavior, tlb behavior, kernel options,
>  >   other random code changes, etc.
>  >
>  >   While it doesn't help heavy userland workloads much, it does help high
>  >   context switch loads a little, and should help those that involve
>  >   switching via kthreads a bit more.
>  >
>  >   A special thanks to Kris for the testing and reality checks, and Jeff for
>  >   tormenting me into doing this. :)
>  >
>  >   This is still work-in-progress.
>
>
> It looks like this patch introduces a regression.
>  In particular, this chunk:
>
>  @@ -181,82 +166,138 @@ sw1:
>         cmpq    %rcx, %rdx
>         pause
>         je      1b
>  -       lfence
>   #endif
>
>  is not totally right as we want to enforce an acq

...an acq memory barrier in order to handle correctly an eventual
thread migration.
We could use this approach, that is what I implemented on ia32 in
order to solve the same problem:

#define	BLOCK_SPIN(reg)							\
		movl		$blocked_lock,%eax ;			\
	100: ;								\
		lock ;							\
		cmpxchgl	%eax,TD_LOCK(reg) ;			\
		jne		101f ;					\
		pause ;							\
		jmp		100b ;					\
	101:

Thanks,
Attilio

[Sorry if I pushed "send" wrongly]

-- 
Peace can only be achieved by understanding - A. Einstein