Re: Periodic rant about SCHED_ULE

From: Tomoaki AOKI <junchoon_at_dec.sakura.ne.jp>
Date: Sat, 13 May 2023 13:19:00 UTC
On Thu, 11 May 2023 20:57:12 +0900
Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote:

> On Wed, 10 May 2023 16:14:03 +0200
> Mateusz Guzik <mjguzik@gmail.com> wrote:
> 
> > On 5/3/23, Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote:
> > > On Mon, 1 May 2023 03:33:18 +0200
> > > Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > >> On 5/1/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> > On 3/31/23, Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote:
> > >> >> On Mon, 27 Mar 2023 16:47:04 +0200
> > >> >> Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> >>
> > >> >>> On 3/27/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> >>> > On 3/25/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> >>> >> On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> >>> >>> On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> >>> >>>> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu>
> > >> >>> >>>> wrote:
> > >> >>> >>>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree
> > >> >>> >>>>> wrote:
> > >> >>> >>>>>>
> > >> >>
> > >> >>    (snip)
> > >> >>
> > >> >>> >
> > >> >>> > I repeat the setup: 8 cores, 8 processes doing cpu-bound stuff
> > >> >>> > while
> > >> >>> > niced to 20 vs make -j buildkernel
> > >> >>> >
> > >> >>> > I had a little more look here, slapped in some hacks as a POC and
> > >> >>> > got
> > >> >>> > an improvement from 67 minutes above to 21.
> > >> >>> >
> > >> >>> > Hacks are:
> > >> >>> > 1. limit hog timeslice to 1 tick so that is more eager to bail
> > >> >>> > 2. always preempt if pri < cpri
> > >> >>> >
> > >> >>> > So far I can confidently state the general problem: ULE penalizes
> > >> >>> > non-cpu hogs for blocking (even if it is not their fault, so to
> > >> >>> > speak)
> > >> >>> > and that bumps their prio past preemption threshold, at which point
> > >> >>> > they can't preempt said hogs (despite hogs having a higher
> > >> >>> > priority).
> > >> >>> > At the same time hogs use their full time slices, while non-hogs
> > >> >>> > get
> > >> >>> > off cpu very early and have to wait a long time to get back on, at
> > >> >>> > least in part due to inability to preempt said hogs.
> > >> >>> >
> > >> >>> > As I mentioned elsewhere in the thread, interactivity scoring takes
> > >> >>> > "voluntary off cpu time" into account. As literally anything but
> > >> >>> > getting preempted counts as "voluntary sleep", workers get shafted
> > >> >>> > for
> > >> >>> > going off cpu while waiting on locks in the kernel.
> > >> >>> >
> > >> >>> > If I/O needs to happen and the thread waits for the result, it most
> > >> >>> > likely does it early in its timeslice and once it's all ready it
> > >> >>> > waits
> > >> >>> > for background hogs to get off cpu -- it can't preempt them.
> > >> >>> >
> > >> >>> > All that said:
> > >> >>> > 1. "interactivity scoring" (see sched_interact_score)
> > >> >>> >
> > >> >>> > I don't know if it makes any sense to begin with. Even if it does,
> > >> >>> > it
> > >> >>> > counts stuff it should not by not differentiating between
> > >> >>> > deliberately
> > >> >>> > going off cpu (e.g., actual sleep) vs just waiting for a file being
> > >> >>> > read. Imagine firefox reading a file from disk and being considered
> > >> >>> > less interactive for it.
> > >> >>> >
> > >> >>> > I don't have a solution for this problem. I *suspect* the way to go
> > >> >>> > would be to explicitly mark xorg/wayland/whatever as "interactive"
> > >> >>> > and
> > >> >>> > have it inherited by its offspring. At the same time it should not
> > >> >>> > follow to stuff spawned in terminals. Not claiming this is perfect,
> > >> >>> > but it does eliminate the guessing game.
> > >> >>> >
> > >> >>> > Even so, 4BSD does not have any mechanism of the sort and
> > >> >>> > reportedly
> > >> >>> > remains usable on a desktop just by providing some degree of
> > >> >>> > fairness.
> > >> >>> >
> > >> >>> > Given that, I suspect the short term solution would whack said
> > >> >>> > scoring
> > >> >>> > altogether and focus on fairness (see below).
> > >> >>> >
> > >> >>> > 2. fairness
> > >> >>> >
> > >> >>> > As explained above doing any offcpu-time inducing work instantly
> > >> >>> > shafts threads versus cpu hogs, even if said hogs are niced way
> > >> >>> > above
> > >> >>> > them.
> > >> >>> >
> > >> >>> > Here I *suspect* position to add in the runqueue should be related
> > >> >>> > to
> > >> >>> > how much slice was left when the thread went off cpu, while making
> > >> >>> > sure that hogs get to run eventually. Not that I have a nice way of
> > >> >>> > implementing this -- maybe a separate queue for known hogs and
> > >> >>> > picking
> > >> >>> > them every n turns or similar.
> > >> >>> >
> > >> >>>
> > >> >>> Aight, now that I had a sober look at the code I think I cracked the
> > >> >>> case.
> > >> >>>
> > >> >>> The runq mechanism used by both 4BSD and ULE provides 64(!) queues,
> > >> >>> where the priority is divided by said number and that's how you know
> > >> >>> in which queue to land the thread.
> > >> >>>
> > >> >>> When deciding what to run, 4BSD uses runq_choose which iterates all
> > >> >>> queues from the beginning. This means threads of lower priority keep
> > >> >>> executing before the rest. In particular cpu hog lands with a high
> > >> >>> priority, looking worse than make -j 8 buildkernel and only running
> > >> >>> when there is nothing else ready to get the cpu. While this may sound
> > >> >>> decent, it is bad -- in principle a steady stream of lower priority
> > >> >>> threads can starve the hogs indefinitely.
> > >> >>>
> > >> >>> The problem was recognized when writing ULE, but improperly fixed --
> > >> >>> it ends up distributing all threads within given priority range
> > >> >>> across
> > >> >>> the queues and then performing a lookup in a given queue. Here the
> > >> >>> problem is that while technically everyone does get a chance to run,
> > >> >>> the threads not using full slices are hosed for the time period as
> > >> >>> they wait for the hog *a lot*.
> > >> >>>
> > >> >>> A hack patch to induce the bogus-but-better 4BSD behavior of draining
> > >> >>> all runqs before running higher prio threads drops down build time to
> > >> >>> ~9 minutes, which is shorter than 4BSD.
> > >> >>>
> > >> >>> However, the right fix would achieve that *without* introducing
> > >> >>> starvation potential.
> > >> >>>
> > >> >>> I also note the runqs are a massive waste of memory and computing
> > >> >>> power. I'm going to have to sleep on what to do here.
> > >> >>>
> > >> >>> For interested here is the hackery:
> > >> >>> https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff
> > >> >>>
> > >> >>> sysctl kern.sched.slice_nice=0
> > >> >>> sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than
> > >> >>> any
> > >> >>> prio
> > >> >>>
> > >> >>> --
> > >> >>> Mateusz Guzik <mjguzik gmail.com>
> > >> >>
> > >> >> Thanks for the patch.
> > >> >> Applied on top of main, amd64 at commit
> > >> >> 9d33a9d96f5a2cd88d0955b5b56ef5058b1706c1, setup 2 sysctls as you
> > >> >> mentioned and tested as below
> > >> >>
> > >> >>   *Play flac files by multimedia/audacious via audio/virtual_oss
> > >> >>   *Running www/firefox (not touched while testing)
> > >> >>   *Forcibly build lang/rust
> > >> >>   *Play games/aisleriot
> > >> >>
> > >> >> at the same time.
> > >> >> games/aisleriot runs slower than the situation lang/rust is not in
> > >> >> build, but didn't "freeze" and audacious normally played next music on
> > >> >> playlist, even on lang/rust is building codes written in rust.
> > >> >>
> > >> >> This is GREAT advance!
> > >> >> Without the patch, compiling rust codes eats up almost 100% of ALL
> > >> >> cores, and games/aisleriot often FREEZES SEVERAL MINUTES, and
> > >> >> multimedia/audacious needs to wait for, at worst, next music for FEW
> > >> >> MINUTES. (Once playback starts, the music is played normally until it
> > >> >> ends.)
> > >> >>
> > >> >> But unfortunately, the patch cannot be applied to stable/13, as some
> > >> >> prerequisite commits are not MFC'ed.
> > >> >> Missing commits are at least as below. There should be more, as I
> > >> >> gave up further tracking and haven't actually merged them to test.
> > >> >>
> > >> >>  commit	954cffe95de1b9d70ed804daa45b7921f0f5c9da [1]
> > >> >>    ule: Simplistic time-sharing for interrupt threads.
> > >> >>
> > >> >>  commit	fea89a2804ad89f5342268a8546a3f9b515b5e6c [2]
> > >> >>    Add sched_ithread_prio to set the base priority of an interrupt
> > >> >>    thread.
> > >> >>
> > >> >>  commit	85b46073242d4666e1c9037d52220422449f9584 [3]
> > >> >>    Deduplicate bus_dma bounce code.
> > >> >>
> > >> >>
> > >> >> [1]
> > >> >> https://cgit.freebsd.org/src/commit/?id=954cffe95de1b9d70ed804daa45b7921f0f5c9da
> > >> >>
> > >> >> [2]
> > >> >> https://cgit.freebsd.org/src/commit/?id=fea89a2804ad89f5342268a8546a3f9b515b5e6c
> > >> >>
> > >> >> [3]
> > >> >> https://cgit.freebsd.org/src/commit/?id=85b46073242d4666e1c9037d52220422449f9584
> > >> >>
> > >> >> --
> > >> >> Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>
> > >> >>
> > >> >>
> > >> >
> > >> > Hello everyone.
> > >> >
> > >> > I sorted out a patch I consider comittable for the time being. IT IS
> > >> > NOT A PANACEA by any means, but it does sort out the most acute
> > >> > problem and should be a win for most people. It also comes with a knob
> > >> > to turn it off.
> > >> >
> > >> > That said, can you test this please:
> > >> > https://people.freebsd.org/~mjg/ule_pickshort.diff
> > >> >
> > >> > works against fresh main. if you are worried about recent zfs woes,
> > >> > just make sure you don't zpool upgrade and will be fine.
> > >> >
> > >>
> > >> Here is an updated patch:
> > >> https://people.freebsd.org/~mjg/ule_pickshortv2.diff
> > >>
> > >> if you are getting bad results, do:
> > >> sysctl kern.sched.preempt_bottom=0
> > >>
> > >> and try again.
> > >>
> > >> thank you.
> > >>
> > >> --
> > >> Mateusz Guzik <mjguzik gmail.com>
> > >
> > > Tried just a bit, and turned out my workload requires
> > > kern.sched.preempt_bottom=0.
> > >
> > > Tested with previous patch (ule-poc-hacks-dont-use.diff) backed out.
> > > At commit d713e0891ff9ab8246245c3206851d486ecfdd37, amd64.
> > >
> > > What I did for tests:
> > >   While building lang/rust, in massive rustc phase,
> > >     *Play FLAC music files using multimedia/audacious
> > >
> > >     *Play youtube movies in www/firefox
> > >
> > >     *edit some junk text on editors/leafpad
> > >
> > >   multimedia/audacious plays via audio/virtual_oss.
> > >   www/firefox plays sounds via audio/pulseaudio
> > >    (Backed with audio/virtual_oss.)
> > >
> > > What happened:
> > >   Without kern.sched.preempt_bottom=0 (was 135), all sounds
> > >   are chopping unless any keytypes or mouse actions are done.
> > >   Text editing was mostly smooth, but always slow.
> > >   These are regardless kern.sched.preempt_thresh settings.
> > >
> > >   With kern.sched.preempt_bottom=0,
> > >     *Playing FLAC music files using multimedia/audacious is fine.
> > >
> > >     *Playing youtube movies in www/firefox depends on
> > >      kern.sched.preempt_thresh setting. The larger the value, the
> > >      smoother the playback is. Tested with values 21, 80, 121, 224, 400.
> > >      Note that 224 and 400 looked almost the same with my eyeballs.
> > >
> > >     *Editing texts is faster, but just sometimes cursor movements
> > >      and editing was not smooth (randomly).
> > >
> > > Sorry, as I'm basically working on stable/13, I cannot take enough time
> > > for testing on main.
> > >
> > >
> > 
> > 
> > thanks for testing
> > 
> > I posted the patch for review: https://reviews.freebsd.org/D40045
> > 
> > it will be suitable for MFC to stable/13
> > 
> > -- 
> > Mateusz Guzik <mjguzik gmail.com>
> 
> Thanks for the update! Tried to apply to stable/13 at commit
> b4e9bfd51c2d6f66291c89c3e8f4c5809f1be447, but unfortunately, fails on
> hunk 7.
> 
> td_slice(), used at line 2635 (2647 after patch) is implemented on main
> at commit 954cffe95de1b9d70ed804daa45b7921f0f5c9da, which is not MFC'ed.
> This function is used on added lines (excluding prototype), at line 501.
> 
> I'll test on main later, hopefully this weekend.
> 
> Regards.
> 
> 
> -- 
> Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>

Tried with conditions below.

 *Backed out previous patch and apply the latest patch at D40045
  ATM I've subscribed there.

 *Updated to commit bee3d4bf8ed55260d8cfc6d168ffa1afb49ef6a8

 *Updated installed ports to latest state.

So it's not a pure comparison.


Proceeding the same tests as previous report, this time,

 *kern.sced.pick_short=1 seems to be slightly better than =0.

 *The larger kern.sched.preempt_thresh is, the smoothe the user
  experience is. Default with this patch (224) would be the lowest.
  Tested values are the same as my previous tests.
  21, 80, 121, 224 and 400.

 *sysctl kern.sched.preempt_bottom no longer exists in the patch,
  thus untested.
  But I feel the previous patch with kern.sched.preempt_bottom=0
  was smoother.

BTW, I've tried several videos and noticed that any video playing
drums, taken from upside or backside are good to determine
delays/hiccups on video. So I've used the youtube video [1] after 3:37.
Any others would be OK, but easier to detect with faster BPM, looking
into movements of sticks.

[1] https://www.youtube.com/watch?v=HmOlZ4zjOhQ


-- 
Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>