cvs commit: src/sys/amd64/amd64 cpu_switch.S

Peter Wemm peter at FreeBSD.org
Sun Mar 23 16:09:07 PDT 2008


peter       2008-03-23 23:09:06 UTC

  FreeBSD src repository

  Modified files:
    sys/amd64/amd64      cpu_switch.S 
  Log:
  First pass at (possibly futile) microoptimizing of cpu_switch.  Results
  are mixed.  Some pure context switch microbenchmarks show up to 29%
  improvement.  Pipe based context switch microbenchmarks show up to 7%
  improvement.  Real world tests are far less impressive as they are
  dominated more by actual work than switch overheads, but depending on
  the machine in question, workload, kernel options, phase of moon, etc, a
  few percent gain might be seen.
  
  Summary of changes:
  - don't reload MSR_[FG]SBASE registers when context switching between
    non-threaded userland apps.  These typically cost 120 clock cycles each
    on an AMD cpu (less on Barcelona/Phenom).  Intel cores are probably no
    faster on this.
  - The above change only helps unthreaded userland apps that tend to use
    the same value for gsbase.  Threaded apps will get no benefit from this.
  - reorder things like accessing the pcb to be in memory order, to give
    prefetching a better chance of working.  Operations are now in increasing
    memory address order, rather than reverse or random.
  - Push some lesser used code out of the main code paths.  Hopefully
    allowing better code density in cache lines.  This is probably futile.
  - (part 2 of previous item) Reorder code so that branches have a more
    realistic static branch prediction hint.  Both Intel and AMD cpus
    default to predicting branches to lower memory addresses as being
    taken, and to higher memory addresses as not being taken.  This is
    overridden by the limited dynamic branch prediction subsystem.  A trip
    through userland might overflow this.
  - Futule attempt at spreading the use of the results of previous operations
    in new operations.  Hopefully this will allow the cpus to execute in
    parallel better.
  - stop wasting 16 bytes at the top of kernel stack, below the PCB.
  - Never load the userland fs/gsbase registers for kthreads, but preserve
    curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)
  
  Microbenchmarking this code seems to be really sensitive to things like
  scheduling luck, timing, cache behavior, tlb behavior, kernel options,
  other random code changes, etc.
  
  While it doesn't help heavy userland workloads much, it does help high
  context switch loads a little, and should help those that involve
  switching via kthreads a bit more.
  
  A special thanks to Kris for the testing and reality checks, and Jeff for
  tormenting me into doing this. :)
  
  This is still work-in-progress.
  
  Revision  Changes    Path
  1.161     +116 -75   src/sys/amd64/amd64/cpu_switch.S


More information about the cvs-src mailing list