Re: Periodic rant about SCHED_ULE

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Mon, 01 May 2023 01:33:18 UTC
On 5/1/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 3/31/23, Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote:
>> On Mon, 27 Mar 2023 16:47:04 +0200
>> Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>>> On 3/27/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> > On 3/25/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> >> On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> >>> On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> >>>> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote:
>>> >>>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote:
>>> >>>>>>
>>
>>    (snip)
>>
>>> >
>>> > I repeat the setup: 8 cores, 8 processes doing cpu-bound stuff while
>>> > niced to 20 vs make -j buildkernel
>>> >
>>> > I had a little more look here, slapped in some hacks as a POC and got
>>> > an improvement from 67 minutes above to 21.
>>> >
>>> > Hacks are:
>>> > 1. limit hog timeslice to 1 tick so that is more eager to bail
>>> > 2. always preempt if pri < cpri
>>> >
>>> > So far I can confidently state the general problem: ULE penalizes
>>> > non-cpu hogs for blocking (even if it is not their fault, so to speak)
>>> > and that bumps their prio past preemption threshold, at which point
>>> > they can't preempt said hogs (despite hogs having a higher priority).
>>> > At the same time hogs use their full time slices, while non-hogs get
>>> > off cpu very early and have to wait a long time to get back on, at
>>> > least in part due to inability to preempt said hogs.
>>> >
>>> > As I mentioned elsewhere in the thread, interactivity scoring takes
>>> > "voluntary off cpu time" into account. As literally anything but
>>> > getting preempted counts as "voluntary sleep", workers get shafted for
>>> > going off cpu while waiting on locks in the kernel.
>>> >
>>> > If I/O needs to happen and the thread waits for the result, it most
>>> > likely does it early in its timeslice and once it's all ready it waits
>>> > for background hogs to get off cpu -- it can't preempt them.
>>> >
>>> > All that said:
>>> > 1. "interactivity scoring" (see sched_interact_score)
>>> >
>>> > I don't know if it makes any sense to begin with. Even if it does, it
>>> > counts stuff it should not by not differentiating between deliberately
>>> > going off cpu (e.g., actual sleep) vs just waiting for a file being
>>> > read. Imagine firefox reading a file from disk and being considered
>>> > less interactive for it.
>>> >
>>> > I don't have a solution for this problem. I *suspect* the way to go
>>> > would be to explicitly mark xorg/wayland/whatever as "interactive" and
>>> > have it inherited by its offspring. At the same time it should not
>>> > follow to stuff spawned in terminals. Not claiming this is perfect,
>>> > but it does eliminate the guessing game.
>>> >
>>> > Even so, 4BSD does not have any mechanism of the sort and reportedly
>>> > remains usable on a desktop just by providing some degree of fairness.
>>> >
>>> > Given that, I suspect the short term solution would whack said scoring
>>> > altogether and focus on fairness (see below).
>>> >
>>> > 2. fairness
>>> >
>>> > As explained above doing any offcpu-time inducing work instantly
>>> > shafts threads versus cpu hogs, even if said hogs are niced way above
>>> > them.
>>> >
>>> > Here I *suspect* position to add in the runqueue should be related to
>>> > how much slice was left when the thread went off cpu, while making
>>> > sure that hogs get to run eventually. Not that I have a nice way of
>>> > implementing this -- maybe a separate queue for known hogs and picking
>>> > them every n turns or similar.
>>> >
>>>
>>> Aight, now that I had a sober look at the code I think I cracked the
>>> case.
>>>
>>> The runq mechanism used by both 4BSD and ULE provides 64(!) queues,
>>> where the priority is divided by said number and that's how you know
>>> in which queue to land the thread.
>>>
>>> When deciding what to run, 4BSD uses runq_choose which iterates all
>>> queues from the beginning. This means threads of lower priority keep
>>> executing before the rest. In particular cpu hog lands with a high
>>> priority, looking worse than make -j 8 buildkernel and only running
>>> when there is nothing else ready to get the cpu. While this may sound
>>> decent, it is bad -- in principle a steady stream of lower priority
>>> threads can starve the hogs indefinitely.
>>>
>>> The problem was recognized when writing ULE, but improperly fixed --
>>> it ends up distributing all threads within given priority range across
>>> the queues and then performing a lookup in a given queue. Here the
>>> problem is that while technically everyone does get a chance to run,
>>> the threads not using full slices are hosed for the time period as
>>> they wait for the hog *a lot*.
>>>
>>> A hack patch to induce the bogus-but-better 4BSD behavior of draining
>>> all runqs before running higher prio threads drops down build time to
>>> ~9 minutes, which is shorter than 4BSD.
>>>
>>> However, the right fix would achieve that *without* introducing
>>> starvation potential.
>>>
>>> I also note the runqs are a massive waste of memory and computing
>>> power. I'm going to have to sleep on what to do here.
>>>
>>> For interested here is the hackery:
>>> https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff
>>>
>>> sysctl kern.sched.slice_nice=0
>>> sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than any
>>> prio
>>>
>>> --
>>> Mateusz Guzik <mjguzik gmail.com>
>>
>> Thanks for the patch.
>> Applied on top of main, amd64 at commit
>> 9d33a9d96f5a2cd88d0955b5b56ef5058b1706c1, setup 2 sysctls as you
>> mentioned and tested as below
>>
>>   *Play flac files by multimedia/audacious via audio/virtual_oss
>>   *Running www/firefox (not touched while testing)
>>   *Forcibly build lang/rust
>>   *Play games/aisleriot
>>
>> at the same time.
>> games/aisleriot runs slower than the situation lang/rust is not in
>> build, but didn't "freeze" and audacious normally played next music on
>> playlist, even on lang/rust is building codes written in rust.
>>
>> This is GREAT advance!
>> Without the patch, compiling rust codes eats up almost 100% of ALL
>> cores, and games/aisleriot often FREEZES SEVERAL MINUTES, and
>> multimedia/audacious needs to wait for, at worst, next music for FEW
>> MINUTES. (Once playback starts, the music is played normally until it
>> ends.)
>>
>> But unfortunately, the patch cannot be applied to stable/13, as some
>> prerequisite commits are not MFC'ed.
>> Missing commits are at least as below. There should be more, as I
>> gave up further tracking and haven't actually merged them to test.
>>
>>  commit	954cffe95de1b9d70ed804daa45b7921f0f5c9da [1]
>>    ule: Simplistic time-sharing for interrupt threads.
>>
>>  commit	fea89a2804ad89f5342268a8546a3f9b515b5e6c [2]
>>    Add sched_ithread_prio to set the base priority of an interrupt
>>    thread.
>>
>>  commit	85b46073242d4666e1c9037d52220422449f9584 [3]
>>    Deduplicate bus_dma bounce code.
>>
>>
>> [1]
>> https://cgit.freebsd.org/src/commit/?id=954cffe95de1b9d70ed804daa45b7921f0f5c9da
>>
>> [2]
>> https://cgit.freebsd.org/src/commit/?id=fea89a2804ad89f5342268a8546a3f9b515b5e6c
>>
>> [3]
>> https://cgit.freebsd.org/src/commit/?id=85b46073242d4666e1c9037d52220422449f9584
>>
>> --
>> Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>
>>
>>
>
> Hello everyone.
>
> I sorted out a patch I consider comittable for the time being. IT IS
> NOT A PANACEA by any means, but it does sort out the most acute
> problem and should be a win for most people. It also comes with a knob
> to turn it off.
>
> That said, can you test this please:
> https://people.freebsd.org/~mjg/ule_pickshort.diff
>
> works against fresh main. if you are worried about recent zfs woes,
> just make sure you don't zpool upgrade and will be fine.
>

Here is an updated patch:
https://people.freebsd.org/~mjg/ule_pickshortv2.diff

if you are getting bad results, do:
sysctl kern.sched.preempt_bottom=0

and try again.

thank you.

-- 
Mateusz Guzik <mjguzik gmail.com>