SCHED_ULE should not be the default
avg at FreeBSD.org
Thu Dec 22 20:25:42 UTC 2011
on 22/12/2011 21:47 Steve Kargl said the following:
> On Thu, Dec 22, 2011 at 09:01:15PM +0200, Andriy Gapon wrote:
>> on 22/12/2011 20:45 Steve Kargl said the following:
>>> I've used schedgraph to look at the ktrdump output. A jpg is
>>> available at http://troutmask.apl.washington.edu/~kargl/freebsd/ktr.jpg
>>> This shows the ping-pong effect where here 3 processes appear to be
>>> using 2 cpus while the remaining 2 processes are pinned to their
>> I'd recommended enabling CPU-specific background colors via the menu in
>> schedgraph for a better illustration of your findings.
>> NB: I still don't understand the point of purposefully running N+1 CPU-bound
> The point is that this is a node in a HPC cluster with
> multiple users. Sure, I can start my job on this node
> with only N cpu-bound jobs. Now, when user John Doe
> wants to run his OpenMPI program should he login into
> the 12 nodes in the cluster to see if someone is already
> running N cpu-bound jobs on a given node? 4BSD
> gives my jobs and John Doe's jobs a fair share of the
> available cpus. ULE does not give a fair share and
> if you read the summary file I put up on the web,
> you see that it is fairly non-deterministic on when a
> OpenMPI run will finish (see the mean absolute deviations
> in the table of 'real' times that I posted).
I think I know why the uneven load occurs. I remember even trying to explain my
There are two things:
1. ULE doesn't have either a common across CPUs runqueue nor any other kind of
mechanism for enforcing true global fairness of CPU resource sharing.
2. ULE's rebalancing code is biased and that leads to the situation where
sub-groups of threads can share subsets of CPUs rather fairly, but there won't
be a global fairness.
I haven't really given any thought as to how to fix or workaround these issues.
One dumb idea is to add an element of randomness to a choice between equally
loaded CPUs (and their subsets) instead of having a permanent bias.
> There is the additional observation in one of my 2008
> emails (URLs have been posted) that if you have N+1
> cpu-bound jobs with, say, job0 and job1 ping-ponging
> on cpu0 (due to ULE's cpu-affinity feature) and if I
> kill job2 running on cpu1, then neither job0 nor job1
> will migrate to cpu1. So, one now has N cpu-bound
> jobs running on N-1 cpus.
Have you checked recently that that is still the case?
I would consider this a rather serious bug as opposed to a sub-optimal scheduling.
> Finally, my initial post in this email thread was to
> tell O. Hartman to quit beating his head against
> a wall with ULE (in an HPC environment). Switch to
> 4BSD. This was based on my 2008 observations and
> I've now wasted 2 days gather additional information
> which only re-affirms my recommendation.
I think that any objective information has its value. So maybe the time is not
really wasted. I think there is no argument that for your usage pattern 4BSD is
better than ULE at the moment, because of the inherent design choices of both
schedulers and their current implementations. But I think that ULE could be
improved to produce more global fairness.
But, but, this thread has seen so many different problem reports about ULE
heaped together that it's very easy to get confused about what is caused by what
and what is real and what is not. E.g. I don't think that there is a direct
relation between this issue (N+1 CPU-bound tasks) and "my X is sluggish with ULE
when I untar a large file".
About the subject line. Let's recall why ULE has become a default. It has
happened because of many observations from users and developers that "things"
were faster/"snappier" with ULE than with 4BSD and a significant stream of
requests to make it the default.
So it's business as usual. The schedulers are different, so there those for
whom one scheduler works better and those for whom the other works better and
those for whom both work reasonably well and those for whom neither is
satisfactory and those who don't really care/compare. There is a silent
majority and the vocal minorities. There are specific bugs and quirks,
advantages and disadvantages, usage patterns, hardware configurations and what
not. When everybody starts to talk at the same time, it's a huge mess. But
silently triaging and debugging one problem at a time also doesn't always work.
There, I've said it. Let me now try to recall why I felt a need to say all of
More information about the freebsd-stable