From nobody Mon Apr 17 17:35:10 2023
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q0Z2868fZz44y11
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Mon, 17 Apr 2023 17:35:12 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Received: from mail-oa1-x35.google.com (mail-oa1-x35.google.com [IPv6:2001:4860:4864:20::35])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4Q0Z282yYvz46XM;
	Mon, 17 Apr 2023 17:35:12 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=gmail.com header.s=20221208 header.b=L8SZ4rDB;
	spf=pass (mx1.freebsd.org: domain of mjguzik@gmail.com designates 2001:4860:4864:20::35 as permitted sender) smtp.mailfrom=mjguzik@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-oa1-x35.google.com with SMTP id 586e51a60fabf-187e9f8c982so3993821fac.1;
        Mon, 17 Apr 2023 10:35:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1681752911; x=1684344911;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=Nq2bIrEI9RlxvlNlXPgBIsvgRvGkjx/2EVTBtXvLLhs=;
        b=L8SZ4rDByGlF7N0DqRT4PXYz4BR4BJEvKb+juyhLMb4ASK6eqApDpz8lKx21BHBvY3
         8T1TfJhhyfIWTBHJUVDz3ccC8FLIIFmEK9vNOtbhriPMkBEi16s91xj9Xz5Zs/awtZY3
         ZyJJizAUn8zoQSVfdxAFtbVKeLkoZTJ1KM47S/kNm3sJy04RA1RAJXRyC/kpBxD/RnGZ
         IDV454m69GAMf7knxNmTkmD7neeBMwxNLO6jAASenxBKu4ESG1m2kpkob7nkh44SHXkf
         IVwv4/zGw3hpU7T0RVeDbqteHVYQHFanygmtI8+dCVYl5tnZdXfpDLyBI4seJbhvpCTD
         GI9A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1681752911; x=1684344911;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Nq2bIrEI9RlxvlNlXPgBIsvgRvGkjx/2EVTBtXvLLhs=;
        b=KJI+1ZZxtWBRw+q+7wF33pHq+DY2EHTcrObIfP9QAxL1WctxQiHySr1m8x5LaosDpj
         tkFY825FtcrvYRYjlyK7nPMMDFWwoRuRF17hCLJ4CH+yCeOiEen0lDT/DmVAaNg4R5tp
         1CiVzBWzipZItkQgrG76cdYAbGS3puCuUcQNmrQAF3M7teK1MDz/OBLlCXhoniRj5YhC
         pBufCE2yPBF94QHVWfOo1AFj1AmjKkHConjZ+G4kOcEmyvOiIMfaW2vBFyw4WDlxvofR
         Mv6FJQ0tcYaOlc8/Dvy3G8cWfejg50+fddznjItKCcLz5WE5qBkU0tBYrIntBx4A6eY8
         sOog==
X-Gm-Message-State: AAQBX9cfeH2XLTuFX02N6uRhu87knv5BJusK7vBBBblKP9PwuYMqwhI5
	31gzRfVaA1c0gyj2M78QlRDzu82srDjjRKIcQPfQBie7
X-Google-Smtp-Source: AKy350ZwNyo9QfL2s6HY0E/LEQ+rrtoFrxtTStOiHL1hRTnlWY2+0B8UJYAl4+75MTNS6/5zpYLMUf4Lz7/wKvKkVP4=
X-Received: by 2002:a05:6870:46ab:b0:184:1a2c:83df with SMTP id
 a43-20020a05687046ab00b001841a2c83dfmr7104013oap.4.1681752910925; Mon, 17 Apr
 2023 10:35:10 -0700 (PDT)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
Received: by 2002:a8a:46:0:b0:49c:b071:b1e3 with HTTP; Mon, 17 Apr 2023
 10:35:10 -0700 (PDT)
In-Reply-To: <ZCcunLvPPPwhRjpe@framework>
References: <c3f5f667-ba0b-c40c-b8a6-19d1c9c63c5f@FreeBSD.org>
 <ZBtRJhNHluj5Nzyk@troutmask.apl.washington.edu> <CAGudoHEj+koaYhkjzDE5KX9OsCno=X5M_E3z9uwg6Pg7dtqTsA@mail.gmail.com>
 <CAGudoHHxTT-Cn11zcFB3ZwF76UcRUv=QS28RLgzd=hVehTy0Kg@mail.gmail.com>
 <CAGudoHGoh30O-3O0jjwevDvP43-ykUt6JUDiwRNW918VZfybhA@mail.gmail.com>
 <CAGudoHEWfy61XSMhXdYOrKWVotuC0Kc6NSWiaaZCy6aQhbvXoQ@mail.gmail.com>
 <CAGudoHFPqz_LtsVNnz4P2gyKXz5Z8hU+v6QYGizm4+DtZRn8Yg@mail.gmail.com>
 <CAGudoHGzBjXjXZFs+qZJUS-M6VeX5=LB2ifRLP7hFBZXPvqP7g@mail.gmail.com>
 <ZCXsPWyIVmxvvHjE@nuc> <CAGudoHGaQxseby2Nc2_57HZ1ZLOwWSyrmZ_eUx15jLCm7znnsw@mail.gmail.com>
 <ZCcunLvPPPwhRjpe@framework>
From: Mateusz Guzik <mjguzik@gmail.com>
Date: Mon, 17 Apr 2023 19:35:10 +0200
Message-ID: <CAGudoHF40zDwhaeO6-G7BHSzxJJ5ej3G490gpb06yd=OZ2do6A@mail.gmail.com>
Subject: Re: Periodic rant about SCHED_ULE
To: Mark Johnston <markj@freebsd.org>
Cc: freebsd-hackers@freebsd.org
Content-Type: text/plain; charset="UTF-8"
X-Spamd-Result: default: False [-4.00 / 15.00];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-1.00)[-1.000];
	DMARC_POLICY_ALLOW(-0.50)[gmail.com,none];
	R_SPF_ALLOW(-0.20)[+ip6:2001:4860:4000::/36];
	R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208];
	MIME_GOOD(-0.10)[text/plain];
	MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org];
	FROM_EQ_ENVFROM(0.00)[];
	MIME_TRACE(0.00)[0:+];
	RCVD_IN_DNSWL_NONE(0.00)[2001:4860:4864:20::35:from];
	ASN(0.00)[asn:15169, ipnet:2001:4860:4864::/48, country:US];
	FREEMAIL_ENVFROM(0.00)[gmail.com];
	ARC_NA(0.00)[];
	MID_RHS_MATCH_FROMTLD(0.00)[];
	DKIM_TRACE(0.00)[gmail.com:+];
	TO_DN_SOME(0.00)[];
	FROM_HAS_DN(0.00)[];
	RCPT_COUNT_TWO(0.00)[2];
	FREEMAIL_FROM(0.00)[gmail.com];
	RCVD_COUNT_THREE(0.00)[3];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	RCVD_TLS_LAST(0.00)[];
	DWL_DNSWL_NONE(0.00)[gmail.com:dkim]
X-Rspamd-Queue-Id: 4Q0Z282yYvz46XM
X-Spamd-Bar: ---
X-ThisMailContainsUnwantedMimeParts: N

Ops, this fell through the cracks, apologies for such a late reply.

On 3/31/23, Mark Johnston <markj@freebsd.org> wrote:
> On Fri, Mar 31, 2023 at 08:41:41PM +0200, Mateusz Guzik wrote:
>> On 3/30/23, Mark Johnston <markj@freebsd.org> wrote:
>> > On Thu, Mar 30, 2023 at 05:36:54PM +0200, Mateusz Guzik wrote:
>> >> I looked into it a little more, below you can find summary and steps
>> >> forward.
>> >>
>> >> First a general statement: while ULE does have performance bugs, it
>> >> has better basis than 4BSD to make scheduling decisions. Most notably
>> >> it understands CPU topology, at least for cases which don't involve
>> >> big.LITTLE. For any non-freak case where 4BSD performs better, it is a
>> >> bug in ULE if this is for any reason other than a tradeoff which can
>> >> be tweaked to line them up. Or more to the point, there should not be
>> >> any legitimate reason to use 4BSD these days and modulo the bugs
>> >> below, you are probably losing on performance for doing so.
>> >>
>> >> Bugs reported in this thread by others and confirmed by me:
>> >> 1. failure to load-balance when having n CPUs and n + 1 workers -- the
>> >> excess one stays on one the same CPU thread continuously penalizing
>> >> the same victim. as a result total real time to execute a finite
>> >> computation is longer than in the case of 4BSD
>> >> 2. unfairness of nice -n 20 threads vs threads going frequently off
>> >> CPU (e.g., due to I/O) -- after using only a fraction of the slice the
>> >> victim has to wait for the cpu hog to use up its entire slice, rinse
>> >> and repeat. This extends a 7+ minute buildkernel to over 67 minutes,
>> >> not an issue on 4BSD
>> >>
>> >> I did not put almost any effort into investigating no 1. There is code
>> >> which is supposed to rebalance load across CPUs, someone(tm) will have
>> >> to sit through it -- for all I know the fix is trivial.
>> >>
>> >> Fixing number 2 makes *another* bug more acute and it complicates the
>> >> whole ordeal.
>> >>
>> >> Thus, bug reported by me:
>> >> 3. interactivity scoring is bogus -- originally introduced to detect
>> >> "interactive" behavior by equating being off CPU with waiting for user
>> >> input. One part of the problem is that it puts *all* non-preempted off
>> >> CPU time into one bag: a voluntary sleep. This includes suffering from
>> >> lock contention in the kernel, lock contention in the program itself,
>> >
>> > Note that time spent off-CPU on a turnstile is not counted as sleeping
>> > for the purpose of interactivity scoring, so this observation applies
>> > only to sx, lockmgr and sleepable rm locks.  That's not to say that
>> > this
>> > behaviour is correct, but it doesn't apply to some of the most
>> > contended
>> > locks unless I'm missing something.
>> >
>>
>> page busy (massively contested for fork/exec), pipe_lock and even
>> not-locks like waitpid(!)
>
> A program that spends most of its time blocked in waitpid, like a shell,
> interactive or not, should indeed have a higher scheduling priority...
>

Maybe it should, but perhaps not at the expense of a more
latency-sensitive program like a video player.

The very notion that off cpu == interactive dates back to the 80s
where it probably made sense, as the unix systems at the time were
mostly just terminal-only and the shell would indeed fit here very
nicely.

>> >> file I/O and so on, none of which has bearing on how interactive or
>> >> not the program might happen to be. A bigger part of the problem is
>> >> that at least today, the graphical programs don't even act this way to
>> >> begin with -- they stay on CPU *a lot*.
>> >
>> > I think this statement deserves more nuance.  I was on a video call
>> > just
>> > now and firefox was consuming about the equivalent of 20-30% of a CPU
>> > across all threads.  What kind of graphical programs are you talking
>> > about specifically?
>> >
>>
>> you don't consider 20-30% a lot?
>
> I would expect a program consuming 20-30% of a CPU to be prioritized
> higher than a CPU hog.  And in my experience, running builds while on a
> call doesn't hurt anything (usually).  Again, there is room for
> improvement, I don't claim the scheduler is perfect.
>

As noted one of the performance bugs is that the scheduler
*unintentionally* penalizes threads which go off cpu a lot for short
periods. If scheduler keeps them in the batch range and there is a hog
in the area, they are using getting disproportionately less cpu.
kernel build is one example I noted -- several times in increase in
total real time vs cpu hogs, while struggling to get any time. For all
I know this bug is why it works fine for you.

>> >> I asked people to provide me with the output of: dtrace -n
>> >> 'sched:::on-cpu { @[execname] = lquantize(curthread->td_priority, 0,
>> >> 224, 1); }' from their laptops/desktops.
>> >>
>> >> One finding is that most people (at least those who reported) use
>> >> firefox.
>> >>
>> >> Another finding is that the browser is above the threshold which would
>> >> be considered "interactive" for vast majority of the time in all
>> >> reported cases.
>> >
>> > That is not true of the output that I sent.  There, most of the firefox
>> > thread samples are in the interactive range [88-135].  Some show an
>> > even
>> > higher priority, maybe due to priority propagation.
>> >
>>
>> That's not the interactive range. 88 is PRI_MIN_BATCH
>
> 88 is PRI_MIN_TIMESHARE (on main, stable/13 ranges are different I
> think).  PRI_MIN_BATCH is PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE = 88 +
> 48 = 136.  Everything in [88-135] goes into the realtime queue.
>

You are right, I misread the code. static_boost seting prio to 72
solidified my misread.

Interestingly this does not change the crux of the matter -- that not
interactive processes cluster in terms of priorities with one which
are interactive. You can see it in your own report.

>> >> I booted a 2 thread vm with xfce and decided to click around. Spawned
>> >> firefox, opened a file manager (Thunar) and from there I opened a
>> >> movie to play with mpv. As root I spawned make -j 2 buildkernel. it
>> >> was not particularly good :)
>> >>
>> >> I found that mpv spawns a bunch of threads, most notably 2 distinct
>> >> threads for audio and video output. The one for video got a priority
>> >> of 175, while the rest had either 88 or 89 -- the lowest for
>> >> timesharing not considered interactive [note lower is considered
>> >> better].
>> >
>> > Presumably all of the video decoding was done in software, since you're
>> > running in a VM?  On my desktop, mpv does not consume much CPU and is
>> > entirely interactive.  Your test suggests that you expect ULE to
>> > prioritize a CPU hog, which doesn't seem realistic absent some
>> > scheduling hints from the user or the program itself.  Problem 2 is the
>> > opposite problem: timesharing CPU hogs are allowed to starve other
>> > timesharing threads.
>> >
>>
>> Now that I pointed out anything >= 88 is *NOT* interactive, are you
>> sure your mpv was considered interactive anyway?
>
> Yes.
>

See above :)

>> I don't expect ULE to prioritize CPU hogs. I'm pointing out how a hog
>> which was a part of an interactive program got shafted, further
>> showing how the method based on counting off cpu time does not work.
>
> You're saying that interactivity scoring should take into account
> overall process behaviour instead of just thread behviour?  Sure, that
> could be reasonable.
>

That's part of it, yes.

>> >> At the same time the file manager who was left in the background kept
>> >> doing evil syscall usage, which as a result bouncing between a regular
>> >> timesharing priority and one which made it "interactive", even though
>> >> the program was not touched for minutes.
>> >>
>> >> Or to put it differently, the scheduler failed to recognize that mpv
>> >> is the program to prioritize, all while thinking the background time
>> >> waster is the thing to look after (so to speak).
>> >>
>> >> This brings us to fixing problem 2: currently, due to the existence of
>> >> said problem, the interactivity scoring woes are less acute -- the
>> >> venerable make -j example is struggling to get CPU time, as a result
>> >> messing with real interactive programs to a lesser extent. If that
>> >> gets fixed, we are in a different boat altogether.
>> >>
>> >> I don't see a clean solution.
>> >>
>> >> Right now I'm toying with the idea of either:
>> >> 1. having programs explicitly tell the kernel they are interactive
>> >
>> > I don't see how this can work.  It's not just traditional "interactive"
>> > programs that benefit from this scoring, it applies also to network
>> > servers and other programs which spend most of their time sleeping but
>> > want to handle requests with low latency.
>> >
>> > Such an interface would also let any program request soft realtime
>> > scheduling without giving up the ability to monopolize CPU time, which
>> > goes against ULE's fairness goals.
>> >
>>
>> Clearly it would be gated with some permission, so only available on a
>> desktop for example.
>>
>> Then again see my response else in the thread: x server could be
>> patched to mark threads.
>
> To do what?
>

To tell the kernel they are interactive clients so that it does not
have to speculate.

Same with pulseaudio and whatever direct /dev/dsp consumer.

>> And it does not go against any fairness goals -- it very much can be
>> achieved, but one has information who can be put off cpu for a longer
>> time without introducing issues.
>>
>> >> 2. adding a scheduler hook to /dev/dsp -- the observation is that if a
>> >> program is producing sound it probably should get some cpu time in a
>> >> timely manner. this would cover audio/video players and web browsers,
>> >
>> > On my system at least firefox doesn't open /dev/dsp, it sends audio
>> > streams to pulseaudio.
>> >
>>
>> I think I noted elsewhere in the thread that pulseaudio may need the
>> same treatment as the x server.
>>
>> >> but would not cover other programs (say a pdf reader). it may be it is
>> >> good enough though
>> >
>> > I think some more thorough analysis, using tools like schedgraph or
>> > KUtrace[1], is needed to characterize the problems you are reporting
>> > with interactivity scoring.  It's also not clear how any of this would
>> > address the problem that started this thread, wherein two competing
>> > timesharing (i.e., non-interactive) workloads get uneven amounts of CPU
>> > time.
>> >
>>
>> I explicitly stated I have not looked into this bit.
>>
>> > There is absolutely room for improvement in ULE's scheduling decisions.
>> > It seems to be common practice to tune various ULE parameters to get
>> > better interactive performance, but in general I see no analysis
>> > explaining /why/ exactly they help and what goes wrong with the default
>> > parameter values in specific workloads.  schedgraph is a very useful
>> > tool for this sort of thing.
>> >
>>
>> I tried schedgraph in the past to look at buildkernel and found it
>> does not cope with the amount of threads, at least on my laptop.
>>
>> > Such tools also required to rule out bugs in ULE itself, when looking
>> > at
>> > abnormal scheduling behaviour.  Last year some scheduler races[2] were
>> > fixed that apparently hurt system performance on EPYC quite a bit.  I
>> > was told privately that applying those patches to 13.1 improved IPSec
>> > throughput by ~25% on EPYC, and I wouldn't be surprised if there are
>> > more improvements to be had which don't involve modifying core
>> > heuristics of the scheduler.  Either way this requires deeper analysis
>> > of ULE's micro-level behaviour; I don't think "interactivity scoring is
>> > bogus" is a useful starting point.
>> >
>>
>> I provided explicit examples how it marked a background thread as
>> interactive, while the real hard worker (if you will) as not
>> interactive, because said worker was not acting the way ULE expects.
>>
>> A bandaid for the time being will stop shafting processes giving up
>> their time slice early in the batch queue, along with some fairness
>> for the rest who does not (like firefox). I'll hack it up for testing.
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
>


-- 
Mateusz Guzik <mjguzik gmail.com>