Re: Periodic rant about SCHED_ULE

From: Bob Bishop <rb_at_gid.co.uk>
Date: Thu, 23 Mar 2023 11:01:44 UTC
Hi,

On 23 Mar 2023, at 10:07, David Chisnall <theraven@freebsd.org> wrote:
> 
> On 22/03/2023 20:15, Stefan Esser wrote:
>> Better balancing of the load would probably make ULE take less real
>> time. The example of 9 identical tasks on 8 cores with 7 tasks getting
>> 100% of a core and the other 2 sharing a core and getting 50% each
>> could be resolved by moving a CPU bound process from the CPU with the
>> highest load to a random CPU (probably not the one with the lowest load
>> or limited to the same cluster or NUMA domain, since then it would stay
>> in a subset of the available cores).
> 
> Two things have changed in CPUs since ULE was written that make the affinity less of a win and may make some low-frequency random rebalancing better:
> 
> Snopping from another core's L1 is a lot cheaper (less true on multi-socket systems, but fortunately ULE is NUMA-aware and so can factor this in), which makes the cost of migrating a thread to another core much cheaper (there are still kernel synchronisation costs, but the cost of running on a core that doesn't have a warm cache is lower: the caches warm very quickly).
> 
> CPUs now have a lot more power domains.  If one core is doing a lot more work than others then there's a good chance that it will be thermally throttled but others may not if they're in a separate power / thermal domain.  This means that keeping a compute-bound process on the same core is the worst thing that you can do if other cores are idle: that core may be throttled back to <2 GHz whereas a core on the other side of the chip may be able to run at >3 GHz.  Evenly heating the entire CPU can have give much better performance if the number of active threads is less than the number of running cores and better fairness in other cases.
> 
> Both ULE and 4BSD are unaware of the heterogeneity of modern CPUs, which often have 2-3 different kinds of core that run at different speeds and neither understands a concept of a power budget, so there's a lot of potential improvement here.  Writing a bad (but working) scheduler is a fairly difficult task, writing a good one is much harder, so I'm not volunteering to do it, but if someone is interested then it would probably be a good candidate for Foundation funding.  I've heard good things about the XNU scheduler recently, that might be a good source of inspiration.
> 
> David
> 

This is spot on as a summary of the landscape. The MacOS scheduler (based on XNU) [1] seems to do a pretty good job with heterogeneous cores vs power management, and MacOS has APIs allowing applications to take account of the thermal state of the total system[2]. But, I haven’t seen any references to fine-grained thermal management as outlined above.

[1] https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html
[2] https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/RespondToThermalStateChanges.html

--
Bob Bishop
rb@gid.co.uk