cvs commit: src/sys/kern sched_ule.c
jroberson at chesapeake.net
Mon Oct 1 23:54:40 PDT 2007
On Mon, 1 Oct 2007, Bruce Evans wrote:
> On Sun, 30 Sep 2007, Jeff Roberson wrote:
>> On Sat, 29 Sep 2007, Kevin Oberman wrote:
>>> YMMV, but ULE seems to generally work better then 4BSD for interactive
>>> uniprocessor systems. The preferred scheduler for uniprocessor servers
>>> is less clear, but many test have shown ULE does better for those
>>> systems in the majority of cases.
>> I feel it's safe to say desktop behavior on UP is definitely superior.
> This is unsafe to say.
>> I think there is no significant difference on UP between 4BSD and ULE
> This may be safe to say, but is inconsistent with the above.
>> except perhaps in context switching microbenchmarks where ULE falls behind.
> It is safe to say that interactive users cannot notice insignificant
> differences. It takes a micro-benchmark to notice possibly-significant
> differences of hundreds or even thousands of nanonseconds for context
Well speaking of context switch microbenchmarks...
I recently looked at lmbench but was disatisfied with the way it measures.
Specifically, I want to see how context switch times scale as you add lots
of threads that are running concurrently. The #procs argument to lat_ctx
does not run these processes concurrently. They each are woken in turn as
a token passes through a chain of pipes.
I wrote a simple tool that does a given number of switches with a given
number of processes. I then simply time to the total execution with
'time'. This avoids the overhead of pipes, sleep/wakeup, and other
complexities. Instead, it uses sched_yield(). The tool is available at:
http://people.freebsd.org/~jeff/yield.c and yield.sh is what I have been
using to measure.
I found that ule on UP was 10% slower than 4BSD at 1 and 10
concurrent threads and 5% slower at 100. It broke even at 1000 and was
about 22% faster at 5,000. Then I wrote:
This is indistinguishable from 4bsd at 1, 10, 100, and 1000 threads while
being 24% faster at 5,000. The 5,000 case is anomolous. I think after
100 we must no longer fit in cache. At 5,000 the time to fork() and
wait() actually shows up significantly. Here's output for 4BSD on UP:
5.69 real 1.17 user 4.48 sys
7.66 real 1.60 user 6.02 sys
8.37 real 1.90 user 6.43 sys
37.96 real 14.28 user 23.26 sys
68.50 real 14.16 user 45.20 sys
And ULE with the above patch:
5.62 real 1.23 user 4.36 sys
7.73 real 1.97 user 5.74 sys
8.34 real 2.01 user 6.30 sys
38.00 real 13.60 user 24.20 sys
52.42 real 13.84 user 38.32 sys
I did multiple runs but didn't average them. They always ended up in the
same ballpark and the patch made such a significant change that I didn't
bother to record and analyze multiple runs.
On SMP ULE pays a price for the per-cpu run queue locks. How well does
that pay off? Here's ULE on an 8 core opteron:
3.91 real 0.35 user 3.55 sys
1.70 real 0.44 user 6.63 sys
1.25 real 1.77 user 8.10 sys
4.49 real 14.46 user 21.43 sys
14.32 real 25.58 user 88.07 sys
And 4BSD on the same:
39.38 real 0.59 user 38.77 sys
62.47 real 0.84 user 493.07 sys
66.42 real 12.23 user 517.77 sys
69.38 real 25.13 user 523.52 sys
131.33 real 33.33 user 930.52 sys
The combination of reduced scheduler locking and improved cache affinity
pays off at about 10x the switch throughput of 4BSD. The actual
cost of the extra synchronization in ULE is about a 5% penalty as measured
with smp.disabled = 1, however, I lost that data and am not interested in
rebooting 3 more times to reclaim it.
> ULE may give higher priority to interactive processes, but most loss of
> interactivity is caused by blocking on I/O, and there is nothing nothing
> a scheduler can do to speed up slow or overloaded devices.
More information about the cvs-src