Re: Periodic rant about SCHED_ULE

From: Steve Kargl <sgk_at_troutmask.apl.washington.edu>
Date: Mon, 27 Mar 2023 16:31:18 UTC
On Mon, Mar 27, 2023 at 04:47:04PM +0200, Mateusz Guzik wrote:
> 

(A mssive amount trimmed to keep this short.)

> Aight, now that I had a sober look at the code I think I cracked the case.
> 
> The runq mechanism used by both 4BSD and ULE provides 64(!) queues,
> where the priority is divided by said number and that's how you know
> in which queue to land the thread.
> 
> When deciding what to run, 4BSD uses runq_choose which iterates all
> queues from the beginning. This means threads of lower priority keep
> executing before the rest. In particular cpu hog lands with a high
> priority, looking worse than make -j 8 buildkernel and only running
> when there is nothing else ready to get the cpu. While this may sound
> decent, it is bad -- in principle a steady stream of lower priority
> threads can starve the hogs indefinitely.
> 
> The problem was recognized when writing ULE, but improperly fixed --
> it ends up distributing all threads within given priority range across
> the queues and then performing a lookup in a given queue. Here the
> problem is that while technically everyone does get a chance to run,
> the threads not using full slices are hosed for the time period as
> they wait for the hog *a lot*.
> 
> A hack patch to induce the bogus-but-better 4BSD behavior of draining
> all runqs before running higher prio threads drops down build time to
> ~9 minutes, which is shorter than 4BSD.
> 
> However, the right fix would achieve that *without* introducing
> starvation potential.
> 
> I also note the runqs are a massive waste of memory and computing
> power. I'm going to have to sleep on what to do here.
> 
> For interested here is the hackery:
> https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff
> 
> sysctl kern.sched.slice_nice=0
> sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than any prio
 
Mateusz,

Thanks for taking a deeper look at the schedulers
and providing your analysis.  If you come up with
any patches that you would like to see have additional
testing, feel free to ping on or off the mailing list.


-- 
Steve