cvs commit: src/sys/kern sched_ule.c

Mon Oct 1 23:54:40 PDT 2007

On Mon, 1 Oct 2007, Bruce Evans wrote:

> On Sun, 30 Sep 2007, Jeff Roberson wrote:
>
>> On Sat, 29 Sep 2007, Kevin Oberman wrote:
>
>>> YMMV, but ULE seems to generally work better then 4BSD for interactive
>>> uniprocessor systems. The preferred scheduler for uniprocessor servers
>>> is less clear, but many test have shown ULE does better for those
>>> systems in the majority of cases.
>> 
>> I feel it's safe to say desktop behavior on UP is definitely superior.
>
> This is unsafe to say.
>
>> I think there is no significant difference on UP between 4BSD and ULE
>
> This may be safe to say, but is inconsistent with the above.
>
>> except perhaps in context switching microbenchmarks where ULE falls behind.
>
> It is safe to say that interactive users cannot notice insignificant
> differences.  It takes a micro-benchmark to notice possibly-significant
> differences of hundreds or even thousands of nanonseconds for context
> switching.

Well speaking of context switch microbenchmarks...

I recently looked at lmbench but was disatisfied with the way it measures. 
Specifically, I want to see how context switch times scale as you add lots 
of threads that are running concurrently.  The #procs argument to lat_ctx 
does not run these processes concurrently.  They each are woken in turn as 
a token passes through a chain of pipes.

I wrote a simple tool that does a given number of switches with a given 
number of processes.  I then simply time to the total execution with 
'time'.  This avoids the overhead of pipes, sleep/wakeup, and other 
complexities.  Instead, it uses sched_yield().  The tool is available at:

http://people.freebsd.org/~jeff/yield.c and yield.sh is what I have been 
using to measure.

I found that ule on UP was 10% slower than 4BSD at 1 and 10 
concurrent threads and 5% slower at 100.  It broke even at 1000 and was 
about 22% faster at 5,000.  Then I wrote:

http://people.freebsd.org/~jeff/ulefaster.diff

This is indistinguishable from 4bsd at 1, 10, 100, and 1000 threads while 
being 24% faster at 5,000.  The 5,000 case is anomolous.  I think after 
100 we must no longer fit in cache.  At 5,000 the time to fork() and 
wait() actually shows up significantly.  Here's output for 4BSD on UP:

         5.69 real         1.17 user         4.48 sys
         7.66 real         1.60 user         6.02 sys
         8.37 real         1.90 user         6.43 sys
        37.96 real        14.28 user        23.26 sys
        68.50 real        14.16 user        45.20 sys

And ULE with the above patch:

         5.62 real         1.23 user         4.36 sys
         7.73 real         1.97 user         5.74 sys
         8.34 real         2.01 user         6.30 sys
        38.00 real        13.60 user        24.20 sys
        52.42 real        13.84 user        38.32 sys

I did multiple runs but didn't average them.  They always ended up in the 
same ballpark and the patch made such a significant change that I didn't 
bother to record and analyze multiple runs.

On SMP ULE pays a price for the per-cpu run queue locks.  How well does 
that pay off?  Here's ULE on an 8 core opteron:

         3.91 real         0.35 user         3.55 sys
         1.70 real         0.44 user         6.63 sys
         1.25 real         1.77 user         8.10 sys
         4.49 real        14.46 user        21.43 sys
        14.32 real        25.58 user        88.07 sys

And 4BSD on the same:
        39.38 real         0.59 user        38.77 sys
        62.47 real         0.84 user       493.07 sys
        66.42 real        12.23 user       517.77 sys
        69.38 real        25.13 user       523.52 sys
       131.33 real        33.33 user       930.52 sys

The combination of reduced scheduler locking and improved cache affinity 
pays off at about 10x the switch throughput of 4BSD.  The actual 
cost of the extra synchronization in ULE is about a 5% penalty as measured 
with smp.disabled = 1, however, I lost that data and am not interested in 
rebooting 3 more times to reclaim it.

Cheers,
Jeff

>
> ULE may give higher priority to interactive processes, but most loss of
> interactivity is caused by blocking on I/O, and there is nothing nothing
> a scheduler can do to speed up slow or overloaded devices.
>
> Bruce
>