Re: Periodic rant about SCHED_ULE

Reply: Fernando_Apesteguía : "Re: Periodic rant about SCHED_ULE"
In reply to: Mark Johnston : "Re: Periodic rant about SCHED_ULE"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Mon, 17 Apr 2023 17:35:10 UTC
Ops, this fell through the cracks, apologies for such a late reply.

On 3/31/23, Mark Johnston <markj@freebsd.org> wrote:
> On Fri, Mar 31, 2023 at 08:41:41PM +0200, Mateusz Guzik wrote:
>> On 3/30/23, Mark Johnston <markj@freebsd.org> wrote:
>> > On Thu, Mar 30, 2023 at 05:36:54PM +0200, Mateusz Guzik wrote:
>> >> I looked into it a little more, below you can find summary and steps
>> >> forward.
>> >>
>> >> First a general statement: while ULE does have performance bugs, it
>> >> has better basis than 4BSD to make scheduling decisions. Most notably
>> >> it understands CPU topology, at least for cases which don't involve
>> >> big.LITTLE. For any non-freak case where 4BSD performs better, it is a
>> >> bug in ULE if this is for any reason other than a tradeoff which can
>> >> be tweaked to line them up. Or more to the point, there should not be
>> >> any legitimate reason to use 4BSD these days and modulo the bugs
>> >> below, you are probably losing on performance for doing so.
>> >>
>> >> Bugs reported in this thread by others and confirmed by me:
>> >> 1. failure to load-balance when having n CPUs and n + 1 workers -- the
>> >> excess one stays on one the same CPU thread continuously penalizing
>> >> the same victim. as a result total real time to execute a finite
>> >> computation is longer than in the case of 4BSD
>> >> 2. unfairness of nice -n 20 threads vs threads going frequently off
>> >> CPU (e.g., due to I/O) -- after using only a fraction of the slice the
>> >> victim has to wait for the cpu hog to use up its entire slice, rinse
>> >> and repeat. This extends a 7+ minute buildkernel to over 67 minutes,
>> >> not an issue on 4BSD
>> >>
>> >> I did not put almost any effort into investigating no 1. There is code
>> >> which is supposed to rebalance load across CPUs, someone(tm) will have
>> >> to sit through it -- for all I know the fix is trivial.
>> >>
>> >> Fixing number 2 makes *another* bug more acute and it complicates the
>> >> whole ordeal.
>> >>
>> >> Thus, bug reported by me:
>> >> 3. interactivity scoring is bogus -- originally introduced to detect
>> >> "interactive" behavior by equating being off CPU with waiting for user
>> >> input. One part of the problem is that it puts *all* non-preempted off
>> >> CPU time into one bag: a voluntary sleep. This includes suffering from
>> >> lock contention in the kernel, lock contention in the program itself,
>> >
>> > Note that time spent off-CPU on a turnstile is not counted as sleeping
>> > for the purpose of interactivity scoring, so this observation applies
>> > only to sx, lockmgr and sleepable rm locks.  That's not to say that
>> > this
>> > behaviour is correct, but it doesn't apply to some of the most
>> > contended
>> > locks unless I'm missing something.
>> >
>>
>> page busy (massively contested for fork/exec), pipe_lock and even
>> not-locks like waitpid(!)
>
> A program that spends most of its time blocked in waitpid, like a shell,
> interactive or not, should indeed have a higher scheduling priority...
>

Maybe it should, but perhaps not at the expense of a more
latency-sensitive program like a video player.

The very notion that off cpu == interactive dates back to the 80s
where it probably made sense, as the unix systems at the time were
mostly just terminal-only and the shell would indeed fit here very
nicely.

>> >> file I/O and so on, none of which has bearing on how interactive or
>> >> not the program might happen to be. A bigger part of the problem is
>> >> that at least today, the graphical programs don't even act this way to
>> >> begin with -- they stay on CPU *a lot*.
>> >
>> > I think this statement deserves more nuance.  I was on a video call
>> > just
>> > now and firefox was consuming about the equivalent of 20-30% of a CPU
>> > across all threads.  What kind of graphical programs are you talking
>> > about specifically?
>> >
>>
>> you don't consider 20-30% a lot?
>
> I would expect a program consuming 20-30% of a CPU to be prioritized
> higher than a CPU hog.  And in my experience, running builds while on a
> call doesn't hurt anything (usually).  Again, there is room for
> improvement, I don't claim the scheduler is perfect.
>

As noted one of the performance bugs is that the scheduler
*unintentionally* penalizes threads which go off cpu a lot for short
periods. If scheduler keeps them in the batch range and there is a hog
in the area, they are using getting disproportionately less cpu.
kernel build is one example I noted -- several times in increase in
total real time vs cpu hogs, while struggling to get any time. For all
I know this bug is why it works fine for you.

>> >> I asked people to provide me with the output of: dtrace -n
>> >> 'sched:::on-cpu { @[execname] = lquantize(curthread->td_priority, 0,
>> >> 224, 1); }' from their laptops/desktops.
>> >>
>> >> One finding is that most people (at least those who reported) use
>> >> firefox.
>> >>
>> >> Another finding is that the browser is above the threshold which would
>> >> be considered "interactive" for vast majority of the time in all
>> >> reported cases.
>> >
>> > That is not true of the output that I sent.  There, most of the firefox
>> > thread samples are in the interactive range [88-135].  Some show an
>> > even
>> > higher priority, maybe due to priority propagation.
>> >
>>
>> That's not the interactive range. 88 is PRI_MIN_BATCH
>
> 88 is PRI_MIN_TIMESHARE (on main, stable/13 ranges are different I
> think).  PRI_MIN_BATCH is PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE = 88 +
> 48 = 136.  Everything in [88-135] goes into the realtime queue.
>

You are right, I misread the code. static_boost seting prio to 72
solidified my misread.

Interestingly this does not change the crux of the matter -- that not
interactive processes cluster in terms of priorities with one which
are interactive. You can see it in your own report.

>> >> I booted a 2 thread vm with xfce and decided to click around. Spawned
>> >> firefox, opened a file manager (Thunar) and from there I opened a
>> >> movie to play with mpv. As root I spawned make -j 2 buildkernel. it
>> >> was not particularly good :)
>> >>
>> >> I found that mpv spawns a bunch of threads, most notably 2 distinct
>> >> threads for audio and video output. The one for video got a priority
>> >> of 175, while the rest had either 88 or 89 -- the lowest for
>> >> timesharing not considered interactive [note lower is considered
>> >> better].
>> >
>> > Presumably all of the video decoding was done in software, since you're
>> > running in a VM?  On my desktop, mpv does not consume much CPU and is
>> > entirely interactive.  Your test suggests that you expect ULE to
>> > prioritize a CPU hog, which doesn't seem realistic absent some
>> > scheduling hints from the user or the program itself.  Problem 2 is the
>> > opposite problem: timesharing CPU hogs are allowed to starve other
>> > timesharing threads.
>> >
>>
>> Now that I pointed out anything >= 88 is *NOT* interactive, are you
>> sure your mpv was considered interactive anyway?
>
> Yes.
>

See above :)

>> I don't expect ULE to prioritize CPU hogs. I'm pointing out how a hog
>> which was a part of an interactive program got shafted, further
>> showing how the method based on counting off cpu time does not work.
>
> You're saying that interactivity scoring should take into account
> overall process behaviour instead of just thread behviour?  Sure, that
> could be reasonable.
>

That's part of it, yes.

>> >> At the same time the file manager who was left in the background kept
>> >> doing evil syscall usage, which as a result bouncing between a regular
>> >> timesharing priority and one which made it "interactive", even though
>> >> the program was not touched for minutes.
>> >>
>> >> Or to put it differently, the scheduler failed to recognize that mpv
>> >> is the program to prioritize, all while thinking the background time
>> >> waster is the thing to look after (so to speak).
>> >>
>> >> This brings us to fixing problem 2: currently, due to the existence of
>> >> said problem, the interactivity scoring woes are less acute -- the
>> >> venerable make -j example is struggling to get CPU time, as a result
>> >> messing with real interactive programs to a lesser extent. If that
>> >> gets fixed, we are in a different boat altogether.
>> >>
>> >> I don't see a clean solution.
>> >>
>> >> Right now I'm toying with the idea of either:
>> >> 1. having programs explicitly tell the kernel they are interactive
>> >
>> > I don't see how this can work.  It's not just traditional "interactive"
>> > programs that benefit from this scoring, it applies also to network
>> > servers and other programs which spend most of their time sleeping but
>> > want to handle requests with low latency.
>> >
>> > Such an interface would also let any program request soft realtime
>> > scheduling without giving up the ability to monopolize CPU time, which
>> > goes against ULE's fairness goals.
>> >
>>
>> Clearly it would be gated with some permission, so only available on a
>> desktop for example.
>>
>> Then again see my response else in the thread: x server could be
>> patched to mark threads.
>
> To do what?
>

To tell the kernel they are interactive clients so that it does not
have to speculate.

Same with pulseaudio and whatever direct /dev/dsp consumer.

>> And it does not go against any fairness goals -- it very much can be
>> achieved, but one has information who can be put off cpu for a longer
>> time without introducing issues.
>>
>> >> 2. adding a scheduler hook to /dev/dsp -- the observation is that if a
>> >> program is producing sound it probably should get some cpu time in a
>> >> timely manner. this would cover audio/video players and web browsers,
>> >
>> > On my system at least firefox doesn't open /dev/dsp, it sends audio
>> > streams to pulseaudio.
>> >
>>
>> I think I noted elsewhere in the thread that pulseaudio may need the
>> same treatment as the x server.
>>
>> >> but would not cover other programs (say a pdf reader). it may be it is
>> >> good enough though
>> >
>> > I think some more thorough analysis, using tools like schedgraph or
>> > KUtrace[1], is needed to characterize the problems you are reporting
>> > with interactivity scoring.  It's also not clear how any of this would
>> > address the problem that started this thread, wherein two competing
>> > timesharing (i.e., non-interactive) workloads get uneven amounts of CPU
>> > time.
>> >
>>
>> I explicitly stated I have not looked into this bit.
>>
>> > There is absolutely room for improvement in ULE's scheduling decisions.
>> > It seems to be common practice to tune various ULE parameters to get
>> > better interactive performance, but in general I see no analysis
>> > explaining /why/ exactly they help and what goes wrong with the default
>> > parameter values in specific workloads.  schedgraph is a very useful
>> > tool for this sort of thing.
>> >
>>
>> I tried schedgraph in the past to look at buildkernel and found it
>> does not cope with the amount of threads, at least on my laptop.
>>
>> > Such tools also required to rule out bugs in ULE itself, when looking
>> > at
>> > abnormal scheduling behaviour.  Last year some scheduler races[2] were
>> > fixed that apparently hurt system performance on EPYC quite a bit.  I
>> > was told privately that applying those patches to 13.1 improved IPSec
>> > throughput by ~25% on EPYC, and I wouldn't be surprised if there are
>> > more improvements to be had which don't involve modifying core
>> > heuristics of the scheduler.  Either way this requires deeper analysis
>> > of ULE's micro-level behaviour; I don't think "interactivity scoring is
>> > bogus" is a useful starting point.
>> >
>>
>> I provided explicit examples how it marked a background thread as
>> interactive, while the real hard worker (if you will) as not
>> interactive, because said worker was not acting the way ULE expects.
>>
>> A bandaid for the time being will stop shafting processes giving up
>> their time slice early in the batch queue, along with some fairness
>> for the rest who does not (like firefox). I'll hack it up for testing.
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
>


-- 
Mateusz Guzik <mjguzik gmail.com>