From nobody Wed May 10 14:14:03 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QGcTV1qLcz49xZG for ; Wed, 10 May 2023 14:14:06 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-oi1-x231.google.com (mail-oi1-x231.google.com [IPv6:2607:f8b0:4864:20::231]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4QGcTS6lcrz3p9C for ; Wed, 10 May 2023 14:14:04 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20221208 header.b=NeeZa4Gw; spf=pass (mx1.freebsd.org: domain of mjguzik@gmail.com designates 2607:f8b0:4864:20::231 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-oi1-x231.google.com with SMTP id 5614622812f47-392116b8f31so2296219b6e.2 for ; Wed, 10 May 2023 07:14:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683728044; x=1686320044; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=N/2vuiJToPsy47qrZ9Wbsp9Ix4DB4dtqGnpP0Oq7FBk=; b=NeeZa4GwA66y8301pbqY4teBty2kn6ESw0y+Ivfu6/rl+5pFFk1vdbQIUOenF2HWZF SHNE9u+F1kzqykt+aDuvuvfvRg5lxNqdQmojPek+0fh/RQMjMf24J88l4vifgGxcY1+s 8ODJJxEgk0Pu8HkpJKO4fVRiNsIkWSAymIr9gCBpqk7NbczAo8uj5pLyQMI8LnEEe2VE R2uuYG5QaxI2/sfDCSVLx7ozIjGmp3b6sfOV1e375RfxhpmUBR69niB3drr8G0EpMYiB ESlMw689SqFQ5RmbhwZfl82M58YB/nm7nsjm6vz9AJP3aqYJt8XZ1ExIfoIdn8LK3BXw 8iPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683728044; x=1686320044; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=N/2vuiJToPsy47qrZ9Wbsp9Ix4DB4dtqGnpP0Oq7FBk=; b=SNQ4khKavomp9ScB1TCmAa9Fkj8ZaZbavrWtZEqHhW6jgN0XlqQmm5nhTaEJw/Bd1z 1vk9eAm8nxI2pz7+SqOGKQPG4USlCDWEvKMQFA9aG3molr8HAPQlpuooYgmXvPzJM1IO omnaS7fs1zKYPcZ8g8eL+9D1okY1Bsf8Q09k/PW5EJ+fwMTGwU1lBhEeMMiPWg23ETef 7HIdbQFLJAKBmYljaDby654fVSALMKGIWfojA+cjoXObdB3Hyc6dUTuH3ZfQnmHK+lA7 F/0nC0FH0u3Gxb5p9gRkAZ2eqo0/bBzvvaMf+Idbq5zJD5KUGgxOuXIoSvndAIAGRDYI yLSw== X-Gm-Message-State: AC+VfDwo1Lfzcc4hUXs8Z1iP2Fbfft8ukLMYWXvl6XNYchXdrOWxE3ZZ 4PDjTWRe654bm6zvi8Dj14lBrSe1IbBHa0tXhduHLr1L X-Google-Smtp-Source: ACHHUZ7i9Pc/8ZceuK2lHatul3TtoxitxSyKuFZkCgsA+ZcRDbIZihf4Z/DAkYGv7zTzsOrwOCWk+TgHlPsuVXbHbEA= X-Received: by 2002:a05:6808:8b:b0:394:2868:d523 with SMTP id s11-20020a056808008b00b003942868d523mr1404172oic.38.1683728043755; Wed, 10 May 2023 07:14:03 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Received: by 2002:a8a:48:0:b0:4d4:94b:7266 with HTTP; Wed, 10 May 2023 07:14:03 -0700 (PDT) In-Reply-To: <20230503084150.061d508f44fc1d79b18f0110@dec.sakura.ne.jp> References: <8173cc7e-e934-dd5c-312a-1dfa886941aa@FreeBSD.org> <8cfdb951-9b1f-ecd3-2291-7a528e1b042c@m5p.com> <20230331215751.166a294f7382c85b545f53a2@dec.sakura.ne.jp> <20230503084150.061d508f44fc1d79b18f0110@dec.sakura.ne.jp> From: Mateusz Guzik Date: Wed, 10 May 2023 16:14:03 +0200 Message-ID: Subject: Re: Periodic rant about SCHED_ULE To: Tomoaki AOKI Cc: freebsd-hackers@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Spamd-Result: default: False [-1.89 / 15.00]; URI_HIDDEN_PATH(1.00)[https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.80)[-0.796]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; MIME_GOOD(-0.10)[text/plain]; NEURAL_HAM_MEDIUM(-0.09)[-0.095]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::231:from]; MID_RHS_MATCH_FROMTLD(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_EQ_ENVFROM(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org] X-Rspamd-Queue-Id: 4QGcTS6lcrz3p9C X-Spamd-Bar: - X-ThisMailContainsUnwantedMimeParts: N On 5/3/23, Tomoaki AOKI wrote: > On Mon, 1 May 2023 03:33:18 +0200 > Mateusz Guzik wrote: > >> On 5/1/23, Mateusz Guzik wrote: >> > On 3/31/23, Tomoaki AOKI wrote: >> >> On Mon, 27 Mar 2023 16:47:04 +0200 >> >> Mateusz Guzik wrote: >> >> >> >>> On 3/27/23, Mateusz Guzik wrote: >> >>> > On 3/25/23, Mateusz Guzik wrote: >> >>> >> On 3/23/23, Mateusz Guzik wrote: >> >>> >>> On 3/22/23, Mateusz Guzik wrote: >> >>> >>>> On 3/22/23, Steve Kargl >> >>> >>>> wrote: >> >>> >>>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree >> >>> >>>>> wrote: >> >>> >>>>>> >> >> >> >> (snip) >> >> >> >>> > >> >>> > I repeat the setup: 8 cores, 8 processes doing cpu-bound stuff >> >>> > while >> >>> > niced to 20 vs make -j buildkernel >> >>> > >> >>> > I had a little more look here, slapped in some hacks as a POC and >> >>> > got >> >>> > an improvement from 67 minutes above to 21. >> >>> > >> >>> > Hacks are: >> >>> > 1. limit hog timeslice to 1 tick so that is more eager to bail >> >>> > 2. always preempt if pri < cpri >> >>> > >> >>> > So far I can confidently state the general problem: ULE penalizes >> >>> > non-cpu hogs for blocking (even if it is not their fault, so to >> >>> > speak) >> >>> > and that bumps their prio past preemption threshold, at which point >> >>> > they can't preempt said hogs (despite hogs having a higher >> >>> > priority). >> >>> > At the same time hogs use their full time slices, while non-hogs >> >>> > get >> >>> > off cpu very early and have to wait a long time to get back on, at >> >>> > least in part due to inability to preempt said hogs. >> >>> > >> >>> > As I mentioned elsewhere in the thread, interactivity scoring takes >> >>> > "voluntary off cpu time" into account. As literally anything but >> >>> > getting preempted counts as "voluntary sleep", workers get shafted >> >>> > for >> >>> > going off cpu while waiting on locks in the kernel. >> >>> > >> >>> > If I/O needs to happen and the thread waits for the result, it most >> >>> > likely does it early in its timeslice and once it's all ready it >> >>> > waits >> >>> > for background hogs to get off cpu -- it can't preempt them. >> >>> > >> >>> > All that said: >> >>> > 1. "interactivity scoring" (see sched_interact_score) >> >>> > >> >>> > I don't know if it makes any sense to begin with. Even if it does, >> >>> > it >> >>> > counts stuff it should not by not differentiating between >> >>> > deliberately >> >>> > going off cpu (e.g., actual sleep) vs just waiting for a file being >> >>> > read. Imagine firefox reading a file from disk and being considered >> >>> > less interactive for it. >> >>> > >> >>> > I don't have a solution for this problem. I *suspect* the way to go >> >>> > would be to explicitly mark xorg/wayland/whatever as "interactive" >> >>> > and >> >>> > have it inherited by its offspring. At the same time it should not >> >>> > follow to stuff spawned in terminals. Not claiming this is perfect, >> >>> > but it does eliminate the guessing game. >> >>> > >> >>> > Even so, 4BSD does not have any mechanism of the sort and >> >>> > reportedly >> >>> > remains usable on a desktop just by providing some degree of >> >>> > fairness. >> >>> > >> >>> > Given that, I suspect the short term solution would whack said >> >>> > scoring >> >>> > altogether and focus on fairness (see below). >> >>> > >> >>> > 2. fairness >> >>> > >> >>> > As explained above doing any offcpu-time inducing work instantly >> >>> > shafts threads versus cpu hogs, even if said hogs are niced way >> >>> > above >> >>> > them. >> >>> > >> >>> > Here I *suspect* position to add in the runqueue should be related >> >>> > to >> >>> > how much slice was left when the thread went off cpu, while making >> >>> > sure that hogs get to run eventually. Not that I have a nice way of >> >>> > implementing this -- maybe a separate queue for known hogs and >> >>> > picking >> >>> > them every n turns or similar. >> >>> > >> >>> >> >>> Aight, now that I had a sober look at the code I think I cracked the >> >>> case. >> >>> >> >>> The runq mechanism used by both 4BSD and ULE provides 64(!) queues, >> >>> where the priority is divided by said number and that's how you know >> >>> in which queue to land the thread. >> >>> >> >>> When deciding what to run, 4BSD uses runq_choose which iterates all >> >>> queues from the beginning. This means threads of lower priority keep >> >>> executing before the rest. In particular cpu hog lands with a high >> >>> priority, looking worse than make -j 8 buildkernel and only running >> >>> when there is nothing else ready to get the cpu. While this may sound >> >>> decent, it is bad -- in principle a steady stream of lower priority >> >>> threads can starve the hogs indefinitely. >> >>> >> >>> The problem was recognized when writing ULE, but improperly fixed -- >> >>> it ends up distributing all threads within given priority range >> >>> across >> >>> the queues and then performing a lookup in a given queue. Here the >> >>> problem is that while technically everyone does get a chance to run, >> >>> the threads not using full slices are hosed for the time period as >> >>> they wait for the hog *a lot*. >> >>> >> >>> A hack patch to induce the bogus-but-better 4BSD behavior of draining >> >>> all runqs before running higher prio threads drops down build time to >> >>> ~9 minutes, which is shorter than 4BSD. >> >>> >> >>> However, the right fix would achieve that *without* introducing >> >>> starvation potential. >> >>> >> >>> I also note the runqs are a massive waste of memory and computing >> >>> power. I'm going to have to sleep on what to do here. >> >>> >> >>> For interested here is the hackery: >> >>> https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff >> >>> >> >>> sysctl kern.sched.slice_nice=0 >> >>> sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than >> >>> any >> >>> prio >> >>> >> >>> -- >> >>> Mateusz Guzik >> >> >> >> Thanks for the patch. >> >> Applied on top of main, amd64 at commit >> >> 9d33a9d96f5a2cd88d0955b5b56ef5058b1706c1, setup 2 sysctls as you >> >> mentioned and tested as below >> >> >> >> *Play flac files by multimedia/audacious via audio/virtual_oss >> >> *Running www/firefox (not touched while testing) >> >> *Forcibly build lang/rust >> >> *Play games/aisleriot >> >> >> >> at the same time. >> >> games/aisleriot runs slower than the situation lang/rust is not in >> >> build, but didn't "freeze" and audacious normally played next music on >> >> playlist, even on lang/rust is building codes written in rust. >> >> >> >> This is GREAT advance! >> >> Without the patch, compiling rust codes eats up almost 100% of ALL >> >> cores, and games/aisleriot often FREEZES SEVERAL MINUTES, and >> >> multimedia/audacious needs to wait for, at worst, next music for FEW >> >> MINUTES. (Once playback starts, the music is played normally until it >> >> ends.) >> >> >> >> But unfortunately, the patch cannot be applied to stable/13, as some >> >> prerequisite commits are not MFC'ed. >> >> Missing commits are at least as below. There should be more, as I >> >> gave up further tracking and haven't actually merged them to test. >> >> >> >> commit 954cffe95de1b9d70ed804daa45b7921f0f5c9da [1] >> >> ule: Simplistic time-sharing for interrupt threads. >> >> >> >> commit fea89a2804ad89f5342268a8546a3f9b515b5e6c [2] >> >> Add sched_ithread_prio to set the base priority of an interrupt >> >> thread. >> >> >> >> commit 85b46073242d4666e1c9037d52220422449f9584 [3] >> >> Deduplicate bus_dma bounce code. >> >> >> >> >> >> [1] >> >> https://cgit.freebsd.org/src/commit/?id=954cffe95de1b9d70ed804daa45b7921f0f5c9da >> >> >> >> [2] >> >> https://cgit.freebsd.org/src/commit/?id=fea89a2804ad89f5342268a8546a3f9b515b5e6c >> >> >> >> [3] >> >> https://cgit.freebsd.org/src/commit/?id=85b46073242d4666e1c9037d52220422449f9584 >> >> >> >> -- >> >> Tomoaki AOKI >> >> >> >> >> > >> > Hello everyone. >> > >> > I sorted out a patch I consider comittable for the time being. IT IS >> > NOT A PANACEA by any means, but it does sort out the most acute >> > problem and should be a win for most people. It also comes with a knob >> > to turn it off. >> > >> > That said, can you test this please: >> > https://people.freebsd.org/~mjg/ule_pickshort.diff >> > >> > works against fresh main. if you are worried about recent zfs woes, >> > just make sure you don't zpool upgrade and will be fine. >> > >> >> Here is an updated patch: >> https://people.freebsd.org/~mjg/ule_pickshortv2.diff >> >> if you are getting bad results, do: >> sysctl kern.sched.preempt_bottom=0 >> >> and try again. >> >> thank you. >> >> -- >> Mateusz Guzik > > Tried just a bit, and turned out my workload requires > kern.sched.preempt_bottom=0. > > Tested with previous patch (ule-poc-hacks-dont-use.diff) backed out. > At commit d713e0891ff9ab8246245c3206851d486ecfdd37, amd64. > > What I did for tests: > While building lang/rust, in massive rustc phase, > *Play FLAC music files using multimedia/audacious > > *Play youtube movies in www/firefox > > *edit some junk text on editors/leafpad > > multimedia/audacious plays via audio/virtual_oss. > www/firefox plays sounds via audio/pulseaudio > (Backed with audio/virtual_oss.) > > What happened: > Without kern.sched.preempt_bottom=0 (was 135), all sounds > are chopping unless any keytypes or mouse actions are done. > Text editing was mostly smooth, but always slow. > These are regardless kern.sched.preempt_thresh settings. > > With kern.sched.preempt_bottom=0, > *Playing FLAC music files using multimedia/audacious is fine. > > *Playing youtube movies in www/firefox depends on > kern.sched.preempt_thresh setting. The larger the value, the > smoother the playback is. Tested with values 21, 80, 121, 224, 400. > Note that 224 and 400 looked almost the same with my eyeballs. > > *Editing texts is faster, but just sometimes cursor movements > and editing was not smooth (randomly). > > Sorry, as I'm basically working on stable/13, I cannot take enough time > for testing on main. > > thanks for testing I posted the patch for review: https://reviews.freebsd.org/D40045 it will be suitable for MFC to stable/13 -- Mateusz Guzik