amd64 cpu_switch in C.

Tue Mar 11 02:25:27 UTC 2008

http://people.freebsd.org/~jeff/amd64.diff

At the above address there is an implementation of cpu_switch() and 
cpu_throw() for amd64 almost entirely in C.  I'm posting this for 
discussion and eventual commit.  There are numerous reasons to do this, I 
will outline some of them.

Implementing the bulk of the code in C allows us to add/modify higher 
level features more easily.  For example, we can change the pmap active 
bits to use a cpuset_t so we can support more than 64 cpus.  It makes the 
code faster because we can do more complicated checks to save time, such 
as avoiding writing the fs/gsbase MSRs if they have not changed.  It makes 
the code faster because infrequently used options can be moved out of the 
normal code paths.

In fact, the c version is ~10% faster than the assembly version at a two 
thread sched_yield() test on a single cpu opteron:

x asm.yield
+ csw.yield
+------------------------------------------------------------------------------+
|     ++                                              x  x 
|
|+ ++ ++ +  + +          +  +   ++ +x    x     x      x  xxx 
x|
| |______M_____A___________|               |__________AM__________| 
|
+------------------------------------------------------------------------------+
     N           Min           Max        Median           Avg 
Stddev
x  10          5.17          5.88           5.5         5.479 
0.19272606
+  15          4.58          5.16          4.71     4.8126667 
0.20738049
Difference at 95.0% confidence
         -0.666333 +/- 0.170431
         -12.1616% +/- 3.11062%
         (Student's t, pooled s = 0.201773)

This test measures the total time to call sched_yield() 10,000,000 times 
between two threads.  Two threads are needed to be sure that the scheduler 
doesn't pick the same thread twice and skip cpu_switch().  The 10% speedup 
is notable because the cpu_switch() routine was consuming less than 40% of 
the cpu prior to the speedup.  So it's almost 1/3rd faster.

Peter also suggested that we can delay portions of the switch until the 
user boundary.  For workloads that involve heavy kernel activity on the 
users part with multiple switches per-syscall this would be a big savings. 
We could also use this as a framework to implement custom switch routines 
if we want to switch directly to ithreads or taskqueue threads in the 
future.

The C routine is supplemented by two assembly routines which are 
responsible for saving the core architecture state and manipulating the 
stack.  These total approximately 50 assembly instructions and are similar 
to savecontext/swapcontext.

The c code saves the old threads context but still runs on its stack as it 
continues the switch.  This is safe because the old thread is locked until 
we call "cpu_switchin()" which is similar to swapcontext.

The only appreciable downside is that it lowers the barrier of entry for 
modifying a very sensitive piece of code.  Still, I think the flexibility 
it gives us outweighs those concerns.

Comments?

Thanks,
Jeff