Re: Periodic rant about SCHED_ULE

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Wed, 10 May 2023 14:14:03 UTC
On 5/3/23, Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote:
> On Mon, 1 May 2023 03:33:18 +0200
> Mateusz Guzik <mjguzik@gmail.com> wrote:
>
>> On 5/1/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> > On 3/31/23, Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote:
>> >> On Mon, 27 Mar 2023 16:47:04 +0200
>> >> Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >>
>> >>> On 3/27/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >>> > On 3/25/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >>> >> On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >>> >>> On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >>> >>>> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu>
>> >>> >>>> wrote:
>> >>> >>>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree
>> >>> >>>>> wrote:
>> >>> >>>>>>
>> >>
>> >>    (snip)
>> >>
>> >>> >
>> >>> > I repeat the setup: 8 cores, 8 processes doing cpu-bound stuff
>> >>> > while
>> >>> > niced to 20 vs make -j buildkernel
>> >>> >
>> >>> > I had a little more look here, slapped in some hacks as a POC and
>> >>> > got
>> >>> > an improvement from 67 minutes above to 21.
>> >>> >
>> >>> > Hacks are:
>> >>> > 1. limit hog timeslice to 1 tick so that is more eager to bail
>> >>> > 2. always preempt if pri < cpri
>> >>> >
>> >>> > So far I can confidently state the general problem: ULE penalizes
>> >>> > non-cpu hogs for blocking (even if it is not their fault, so to
>> >>> > speak)
>> >>> > and that bumps their prio past preemption threshold, at which point
>> >>> > they can't preempt said hogs (despite hogs having a higher
>> >>> > priority).
>> >>> > At the same time hogs use their full time slices, while non-hogs
>> >>> > get
>> >>> > off cpu very early and have to wait a long time to get back on, at
>> >>> > least in part due to inability to preempt said hogs.
>> >>> >
>> >>> > As I mentioned elsewhere in the thread, interactivity scoring takes
>> >>> > "voluntary off cpu time" into account. As literally anything but
>> >>> > getting preempted counts as "voluntary sleep", workers get shafted
>> >>> > for
>> >>> > going off cpu while waiting on locks in the kernel.
>> >>> >
>> >>> > If I/O needs to happen and the thread waits for the result, it most
>> >>> > likely does it early in its timeslice and once it's all ready it
>> >>> > waits
>> >>> > for background hogs to get off cpu -- it can't preempt them.
>> >>> >
>> >>> > All that said:
>> >>> > 1. "interactivity scoring" (see sched_interact_score)
>> >>> >
>> >>> > I don't know if it makes any sense to begin with. Even if it does,
>> >>> > it
>> >>> > counts stuff it should not by not differentiating between
>> >>> > deliberately
>> >>> > going off cpu (e.g., actual sleep) vs just waiting for a file being
>> >>> > read. Imagine firefox reading a file from disk and being considered
>> >>> > less interactive for it.
>> >>> >
>> >>> > I don't have a solution for this problem. I *suspect* the way to go
>> >>> > would be to explicitly mark xorg/wayland/whatever as "interactive"
>> >>> > and
>> >>> > have it inherited by its offspring. At the same time it should not
>> >>> > follow to stuff spawned in terminals. Not claiming this is perfect,
>> >>> > but it does eliminate the guessing game.
>> >>> >
>> >>> > Even so, 4BSD does not have any mechanism of the sort and
>> >>> > reportedly
>> >>> > remains usable on a desktop just by providing some degree of
>> >>> > fairness.
>> >>> >
>> >>> > Given that, I suspect the short term solution would whack said
>> >>> > scoring
>> >>> > altogether and focus on fairness (see below).
>> >>> >
>> >>> > 2. fairness
>> >>> >
>> >>> > As explained above doing any offcpu-time inducing work instantly
>> >>> > shafts threads versus cpu hogs, even if said hogs are niced way
>> >>> > above
>> >>> > them.
>> >>> >
>> >>> > Here I *suspect* position to add in the runqueue should be related
>> >>> > to
>> >>> > how much slice was left when the thread went off cpu, while making
>> >>> > sure that hogs get to run eventually. Not that I have a nice way of
>> >>> > implementing this -- maybe a separate queue for known hogs and
>> >>> > picking
>> >>> > them every n turns or similar.
>> >>> >
>> >>>
>> >>> Aight, now that I had a sober look at the code I think I cracked the
>> >>> case.
>> >>>
>> >>> The runq mechanism used by both 4BSD and ULE provides 64(!) queues,
>> >>> where the priority is divided by said number and that's how you know
>> >>> in which queue to land the thread.
>> >>>
>> >>> When deciding what to run, 4BSD uses runq_choose which iterates all
>> >>> queues from the beginning. This means threads of lower priority keep
>> >>> executing before the rest. In particular cpu hog lands with a high
>> >>> priority, looking worse than make -j 8 buildkernel and only running
>> >>> when there is nothing else ready to get the cpu. While this may sound
>> >>> decent, it is bad -- in principle a steady stream of lower priority
>> >>> threads can starve the hogs indefinitely.
>> >>>
>> >>> The problem was recognized when writing ULE, but improperly fixed --
>> >>> it ends up distributing all threads within given priority range
>> >>> across
>> >>> the queues and then performing a lookup in a given queue. Here the
>> >>> problem is that while technically everyone does get a chance to run,
>> >>> the threads not using full slices are hosed for the time period as
>> >>> they wait for the hog *a lot*.
>> >>>
>> >>> A hack patch to induce the bogus-but-better 4BSD behavior of draining
>> >>> all runqs before running higher prio threads drops down build time to
>> >>> ~9 minutes, which is shorter than 4BSD.
>> >>>
>> >>> However, the right fix would achieve that *without* introducing
>> >>> starvation potential.
>> >>>
>> >>> I also note the runqs are a massive waste of memory and computing
>> >>> power. I'm going to have to sleep on what to do here.
>> >>>
>> >>> For interested here is the hackery:
>> >>> https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff
>> >>>
>> >>> sysctl kern.sched.slice_nice=0
>> >>> sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than
>> >>> any
>> >>> prio
>> >>>
>> >>> --
>> >>> Mateusz Guzik <mjguzik gmail.com>
>> >>
>> >> Thanks for the patch.
>> >> Applied on top of main, amd64 at commit
>> >> 9d33a9d96f5a2cd88d0955b5b56ef5058b1706c1, setup 2 sysctls as you
>> >> mentioned and tested as below
>> >>
>> >>   *Play flac files by multimedia/audacious via audio/virtual_oss
>> >>   *Running www/firefox (not touched while testing)
>> >>   *Forcibly build lang/rust
>> >>   *Play games/aisleriot
>> >>
>> >> at the same time.
>> >> games/aisleriot runs slower than the situation lang/rust is not in
>> >> build, but didn't "freeze" and audacious normally played next music on
>> >> playlist, even on lang/rust is building codes written in rust.
>> >>
>> >> This is GREAT advance!
>> >> Without the patch, compiling rust codes eats up almost 100% of ALL
>> >> cores, and games/aisleriot often FREEZES SEVERAL MINUTES, and
>> >> multimedia/audacious needs to wait for, at worst, next music for FEW
>> >> MINUTES. (Once playback starts, the music is played normally until it
>> >> ends.)
>> >>
>> >> But unfortunately, the patch cannot be applied to stable/13, as some
>> >> prerequisite commits are not MFC'ed.
>> >> Missing commits are at least as below. There should be more, as I
>> >> gave up further tracking and haven't actually merged them to test.
>> >>
>> >>  commit	954cffe95de1b9d70ed804daa45b7921f0f5c9da [1]
>> >>    ule: Simplistic time-sharing for interrupt threads.
>> >>
>> >>  commit	fea89a2804ad89f5342268a8546a3f9b515b5e6c [2]
>> >>    Add sched_ithread_prio to set the base priority of an interrupt
>> >>    thread.
>> >>
>> >>  commit	85b46073242d4666e1c9037d52220422449f9584 [3]
>> >>    Deduplicate bus_dma bounce code.
>> >>
>> >>
>> >> [1]
>> >> https://cgit.freebsd.org/src/commit/?id=954cffe95de1b9d70ed804daa45b7921f0f5c9da
>> >>
>> >> [2]
>> >> https://cgit.freebsd.org/src/commit/?id=fea89a2804ad89f5342268a8546a3f9b515b5e6c
>> >>
>> >> [3]
>> >> https://cgit.freebsd.org/src/commit/?id=85b46073242d4666e1c9037d52220422449f9584
>> >>
>> >> --
>> >> Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>
>> >>
>> >>
>> >
>> > Hello everyone.
>> >
>> > I sorted out a patch I consider comittable for the time being. IT IS
>> > NOT A PANACEA by any means, but it does sort out the most acute
>> > problem and should be a win for most people. It also comes with a knob
>> > to turn it off.
>> >
>> > That said, can you test this please:
>> > https://people.freebsd.org/~mjg/ule_pickshort.diff
>> >
>> > works against fresh main. if you are worried about recent zfs woes,
>> > just make sure you don't zpool upgrade and will be fine.
>> >
>>
>> Here is an updated patch:
>> https://people.freebsd.org/~mjg/ule_pickshortv2.diff
>>
>> if you are getting bad results, do:
>> sysctl kern.sched.preempt_bottom=0
>>
>> and try again.
>>
>> thank you.
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
>
> Tried just a bit, and turned out my workload requires
> kern.sched.preempt_bottom=0.
>
> Tested with previous patch (ule-poc-hacks-dont-use.diff) backed out.
> At commit d713e0891ff9ab8246245c3206851d486ecfdd37, amd64.
>
> What I did for tests:
>   While building lang/rust, in massive rustc phase,
>     *Play FLAC music files using multimedia/audacious
>
>     *Play youtube movies in www/firefox
>
>     *edit some junk text on editors/leafpad
>
>   multimedia/audacious plays via audio/virtual_oss.
>   www/firefox plays sounds via audio/pulseaudio
>    (Backed with audio/virtual_oss.)
>
> What happened:
>   Without kern.sched.preempt_bottom=0 (was 135), all sounds
>   are chopping unless any keytypes or mouse actions are done.
>   Text editing was mostly smooth, but always slow.
>   These are regardless kern.sched.preempt_thresh settings.
>
>   With kern.sched.preempt_bottom=0,
>     *Playing FLAC music files using multimedia/audacious is fine.
>
>     *Playing youtube movies in www/firefox depends on
>      kern.sched.preempt_thresh setting. The larger the value, the
>      smoother the playback is. Tested with values 21, 80, 121, 224, 400.
>      Note that 224 and 400 looked almost the same with my eyeballs.
>
>     *Editing texts is faster, but just sometimes cursor movements
>      and editing was not smooth (randomly).
>
> Sorry, as I'm basically working on stable/13, I cannot take enough time
> for testing on main.
>
>


thanks for testing

I posted the patch for review: https://reviews.freebsd.org/D40045

it will be suitable for MFC to stable/13

-- 
Mateusz Guzik <mjguzik gmail.com>