From nobody Mon Apr 17 17:35:10 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q0Z2868fZz44y11 for ; Mon, 17 Apr 2023 17:35:12 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-oa1-x35.google.com (mail-oa1-x35.google.com [IPv6:2001:4860:4864:20::35]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Q0Z282yYvz46XM; Mon, 17 Apr 2023 17:35:12 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20221208 header.b=L8SZ4rDB; spf=pass (mx1.freebsd.org: domain of mjguzik@gmail.com designates 2001:4860:4864:20::35 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-oa1-x35.google.com with SMTP id 586e51a60fabf-187e9f8c982so3993821fac.1; Mon, 17 Apr 2023 10:35:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681752911; x=1684344911; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Nq2bIrEI9RlxvlNlXPgBIsvgRvGkjx/2EVTBtXvLLhs=; b=L8SZ4rDByGlF7N0DqRT4PXYz4BR4BJEvKb+juyhLMb4ASK6eqApDpz8lKx21BHBvY3 8T1TfJhhyfIWTBHJUVDz3ccC8FLIIFmEK9vNOtbhriPMkBEi16s91xj9Xz5Zs/awtZY3 ZyJJizAUn8zoQSVfdxAFtbVKeLkoZTJ1KM47S/kNm3sJy04RA1RAJXRyC/kpBxD/RnGZ IDV454m69GAMf7knxNmTkmD7neeBMwxNLO6jAASenxBKu4ESG1m2kpkob7nkh44SHXkf IVwv4/zGw3hpU7T0RVeDbqteHVYQHFanygmtI8+dCVYl5tnZdXfpDLyBI4seJbhvpCTD GI9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681752911; x=1684344911; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Nq2bIrEI9RlxvlNlXPgBIsvgRvGkjx/2EVTBtXvLLhs=; b=KJI+1ZZxtWBRw+q+7wF33pHq+DY2EHTcrObIfP9QAxL1WctxQiHySr1m8x5LaosDpj tkFY825FtcrvYRYjlyK7nPMMDFWwoRuRF17hCLJ4CH+yCeOiEen0lDT/DmVAaNg4R5tp 1CiVzBWzipZItkQgrG76cdYAbGS3puCuUcQNmrQAF3M7teK1MDz/OBLlCXhoniRj5YhC pBufCE2yPBF94QHVWfOo1AFj1AmjKkHConjZ+G4kOcEmyvOiIMfaW2vBFyw4WDlxvofR Mv6FJQ0tcYaOlc8/Dvy3G8cWfejg50+fddznjItKCcLz5WE5qBkU0tBYrIntBx4A6eY8 sOog== X-Gm-Message-State: AAQBX9cfeH2XLTuFX02N6uRhu87knv5BJusK7vBBBblKP9PwuYMqwhI5 31gzRfVaA1c0gyj2M78QlRDzu82srDjjRKIcQPfQBie7 X-Google-Smtp-Source: AKy350ZwNyo9QfL2s6HY0E/LEQ+rrtoFrxtTStOiHL1hRTnlWY2+0B8UJYAl4+75MTNS6/5zpYLMUf4Lz7/wKvKkVP4= X-Received: by 2002:a05:6870:46ab:b0:184:1a2c:83df with SMTP id a43-20020a05687046ab00b001841a2c83dfmr7104013oap.4.1681752910925; Mon, 17 Apr 2023 10:35:10 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Received: by 2002:a8a:46:0:b0:49c:b071:b1e3 with HTTP; Mon, 17 Apr 2023 10:35:10 -0700 (PDT) In-Reply-To: References: From: Mateusz Guzik Date: Mon, 17 Apr 2023 19:35:10 +0200 Message-ID: Subject: Re: Periodic rant about SCHED_ULE To: Mark Johnston Cc: freebsd-hackers@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2001:4860:4000::/36]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208]; MIME_GOOD(-0.10)[text/plain]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_IN_DNSWL_NONE(0.00)[2001:4860:4864:20::35:from]; ASN(0.00)[asn:15169, ipnet:2001:4860:4864::/48, country:US]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ARC_NA(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; TO_DN_SOME(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_TLS_LAST(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-Rspamd-Queue-Id: 4Q0Z282yYvz46XM X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N Ops, this fell through the cracks, apologies for such a late reply. On 3/31/23, Mark Johnston wrote: > On Fri, Mar 31, 2023 at 08:41:41PM +0200, Mateusz Guzik wrote: >> On 3/30/23, Mark Johnston wrote: >> > On Thu, Mar 30, 2023 at 05:36:54PM +0200, Mateusz Guzik wrote: >> >> I looked into it a little more, below you can find summary and steps >> >> forward. >> >> >> >> First a general statement: while ULE does have performance bugs, it >> >> has better basis than 4BSD to make scheduling decisions. Most notably >> >> it understands CPU topology, at least for cases which don't involve >> >> big.LITTLE. For any non-freak case where 4BSD performs better, it is a >> >> bug in ULE if this is for any reason other than a tradeoff which can >> >> be tweaked to line them up. Or more to the point, there should not be >> >> any legitimate reason to use 4BSD these days and modulo the bugs >> >> below, you are probably losing on performance for doing so. >> >> >> >> Bugs reported in this thread by others and confirmed by me: >> >> 1. failure to load-balance when having n CPUs and n + 1 workers -- the >> >> excess one stays on one the same CPU thread continuously penalizing >> >> the same victim. as a result total real time to execute a finite >> >> computation is longer than in the case of 4BSD >> >> 2. unfairness of nice -n 20 threads vs threads going frequently off >> >> CPU (e.g., due to I/O) -- after using only a fraction of the slice the >> >> victim has to wait for the cpu hog to use up its entire slice, rinse >> >> and repeat. This extends a 7+ minute buildkernel to over 67 minutes, >> >> not an issue on 4BSD >> >> >> >> I did not put almost any effort into investigating no 1. There is code >> >> which is supposed to rebalance load across CPUs, someone(tm) will have >> >> to sit through it -- for all I know the fix is trivial. >> >> >> >> Fixing number 2 makes *another* bug more acute and it complicates the >> >> whole ordeal. >> >> >> >> Thus, bug reported by me: >> >> 3. interactivity scoring is bogus -- originally introduced to detect >> >> "interactive" behavior by equating being off CPU with waiting for user >> >> input. One part of the problem is that it puts *all* non-preempted off >> >> CPU time into one bag: a voluntary sleep. This includes suffering from >> >> lock contention in the kernel, lock contention in the program itself, >> > >> > Note that time spent off-CPU on a turnstile is not counted as sleeping >> > for the purpose of interactivity scoring, so this observation applies >> > only to sx, lockmgr and sleepable rm locks. That's not to say that >> > this >> > behaviour is correct, but it doesn't apply to some of the most >> > contended >> > locks unless I'm missing something. >> > >> >> page busy (massively contested for fork/exec), pipe_lock and even >> not-locks like waitpid(!) > > A program that spends most of its time blocked in waitpid, like a shell, > interactive or not, should indeed have a higher scheduling priority... > Maybe it should, but perhaps not at the expense of a more latency-sensitive program like a video player. The very notion that off cpu == interactive dates back to the 80s where it probably made sense, as the unix systems at the time were mostly just terminal-only and the shell would indeed fit here very nicely. >> >> file I/O and so on, none of which has bearing on how interactive or >> >> not the program might happen to be. A bigger part of the problem is >> >> that at least today, the graphical programs don't even act this way to >> >> begin with -- they stay on CPU *a lot*. >> > >> > I think this statement deserves more nuance. I was on a video call >> > just >> > now and firefox was consuming about the equivalent of 20-30% of a CPU >> > across all threads. What kind of graphical programs are you talking >> > about specifically? >> > >> >> you don't consider 20-30% a lot? > > I would expect a program consuming 20-30% of a CPU to be prioritized > higher than a CPU hog. And in my experience, running builds while on a > call doesn't hurt anything (usually). Again, there is room for > improvement, I don't claim the scheduler is perfect. > As noted one of the performance bugs is that the scheduler *unintentionally* penalizes threads which go off cpu a lot for short periods. If scheduler keeps them in the batch range and there is a hog in the area, they are using getting disproportionately less cpu. kernel build is one example I noted -- several times in increase in total real time vs cpu hogs, while struggling to get any time. For all I know this bug is why it works fine for you. >> >> I asked people to provide me with the output of: dtrace -n >> >> 'sched:::on-cpu { @[execname] = lquantize(curthread->td_priority, 0, >> >> 224, 1); }' from their laptops/desktops. >> >> >> >> One finding is that most people (at least those who reported) use >> >> firefox. >> >> >> >> Another finding is that the browser is above the threshold which would >> >> be considered "interactive" for vast majority of the time in all >> >> reported cases. >> > >> > That is not true of the output that I sent. There, most of the firefox >> > thread samples are in the interactive range [88-135]. Some show an >> > even >> > higher priority, maybe due to priority propagation. >> > >> >> That's not the interactive range. 88 is PRI_MIN_BATCH > > 88 is PRI_MIN_TIMESHARE (on main, stable/13 ranges are different I > think). PRI_MIN_BATCH is PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE = 88 + > 48 = 136. Everything in [88-135] goes into the realtime queue. > You are right, I misread the code. static_boost seting prio to 72 solidified my misread. Interestingly this does not change the crux of the matter -- that not interactive processes cluster in terms of priorities with one which are interactive. You can see it in your own report. >> >> I booted a 2 thread vm with xfce and decided to click around. Spawned >> >> firefox, opened a file manager (Thunar) and from there I opened a >> >> movie to play with mpv. As root I spawned make -j 2 buildkernel. it >> >> was not particularly good :) >> >> >> >> I found that mpv spawns a bunch of threads, most notably 2 distinct >> >> threads for audio and video output. The one for video got a priority >> >> of 175, while the rest had either 88 or 89 -- the lowest for >> >> timesharing not considered interactive [note lower is considered >> >> better]. >> > >> > Presumably all of the video decoding was done in software, since you're >> > running in a VM? On my desktop, mpv does not consume much CPU and is >> > entirely interactive. Your test suggests that you expect ULE to >> > prioritize a CPU hog, which doesn't seem realistic absent some >> > scheduling hints from the user or the program itself. Problem 2 is the >> > opposite problem: timesharing CPU hogs are allowed to starve other >> > timesharing threads. >> > >> >> Now that I pointed out anything >= 88 is *NOT* interactive, are you >> sure your mpv was considered interactive anyway? > > Yes. > See above :) >> I don't expect ULE to prioritize CPU hogs. I'm pointing out how a hog >> which was a part of an interactive program got shafted, further >> showing how the method based on counting off cpu time does not work. > > You're saying that interactivity scoring should take into account > overall process behaviour instead of just thread behviour? Sure, that > could be reasonable. > That's part of it, yes. >> >> At the same time the file manager who was left in the background kept >> >> doing evil syscall usage, which as a result bouncing between a regular >> >> timesharing priority and one which made it "interactive", even though >> >> the program was not touched for minutes. >> >> >> >> Or to put it differently, the scheduler failed to recognize that mpv >> >> is the program to prioritize, all while thinking the background time >> >> waster is the thing to look after (so to speak). >> >> >> >> This brings us to fixing problem 2: currently, due to the existence of >> >> said problem, the interactivity scoring woes are less acute -- the >> >> venerable make -j example is struggling to get CPU time, as a result >> >> messing with real interactive programs to a lesser extent. If that >> >> gets fixed, we are in a different boat altogether. >> >> >> >> I don't see a clean solution. >> >> >> >> Right now I'm toying with the idea of either: >> >> 1. having programs explicitly tell the kernel they are interactive >> > >> > I don't see how this can work. It's not just traditional "interactive" >> > programs that benefit from this scoring, it applies also to network >> > servers and other programs which spend most of their time sleeping but >> > want to handle requests with low latency. >> > >> > Such an interface would also let any program request soft realtime >> > scheduling without giving up the ability to monopolize CPU time, which >> > goes against ULE's fairness goals. >> > >> >> Clearly it would be gated with some permission, so only available on a >> desktop for example. >> >> Then again see my response else in the thread: x server could be >> patched to mark threads. > > To do what? > To tell the kernel they are interactive clients so that it does not have to speculate. Same with pulseaudio and whatever direct /dev/dsp consumer. >> And it does not go against any fairness goals -- it very much can be >> achieved, but one has information who can be put off cpu for a longer >> time without introducing issues. >> >> >> 2. adding a scheduler hook to /dev/dsp -- the observation is that if a >> >> program is producing sound it probably should get some cpu time in a >> >> timely manner. this would cover audio/video players and web browsers, >> > >> > On my system at least firefox doesn't open /dev/dsp, it sends audio >> > streams to pulseaudio. >> > >> >> I think I noted elsewhere in the thread that pulseaudio may need the >> same treatment as the x server. >> >> >> but would not cover other programs (say a pdf reader). it may be it is >> >> good enough though >> > >> > I think some more thorough analysis, using tools like schedgraph or >> > KUtrace[1], is needed to characterize the problems you are reporting >> > with interactivity scoring. It's also not clear how any of this would >> > address the problem that started this thread, wherein two competing >> > timesharing (i.e., non-interactive) workloads get uneven amounts of CPU >> > time. >> > >> >> I explicitly stated I have not looked into this bit. >> >> > There is absolutely room for improvement in ULE's scheduling decisions. >> > It seems to be common practice to tune various ULE parameters to get >> > better interactive performance, but in general I see no analysis >> > explaining /why/ exactly they help and what goes wrong with the default >> > parameter values in specific workloads. schedgraph is a very useful >> > tool for this sort of thing. >> > >> >> I tried schedgraph in the past to look at buildkernel and found it >> does not cope with the amount of threads, at least on my laptop. >> >> > Such tools also required to rule out bugs in ULE itself, when looking >> > at >> > abnormal scheduling behaviour. Last year some scheduler races[2] were >> > fixed that apparently hurt system performance on EPYC quite a bit. I >> > was told privately that applying those patches to 13.1 improved IPSec >> > throughput by ~25% on EPYC, and I wouldn't be surprised if there are >> > more improvements to be had which don't involve modifying core >> > heuristics of the scheduler. Either way this requires deeper analysis >> > of ULE's micro-level behaviour; I don't think "interactivity scoring is >> > bogus" is a useful starting point. >> > >> >> I provided explicit examples how it marked a background thread as >> interactive, while the real hard worker (if you will) as not >> interactive, because said worker was not acting the way ULE expects. >> >> A bandaid for the time being will stop shafting processes giving up >> their time slice early in the batch queue, along with some fairness >> for the rest who does not (like firefox). I'll hack it up for testing. >> >> -- >> Mateusz Guzik > -- Mateusz Guzik