From nobody Sat Mar 25 01:46:43 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Pk24S514hz41S24 for ; Sat, 25 Mar 2023 01:46:48 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-ot1-x32c.google.com (mail-ot1-x32c.google.com [IPv6:2607:f8b0:4864:20::32c]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Pk24R03rLz46qc; Sat, 25 Mar 2023 01:46:47 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=d1bnIATG; spf=pass (mx1.freebsd.org: domain of mjguzik@gmail.com designates 2607:f8b0:4864:20::32c as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ot1-x32c.google.com with SMTP id i16-20020a9d6110000000b0069fb468226aso1836803otj.3; Fri, 24 Mar 2023 18:46:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679708804; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Wcn3NqTFR0W1dzgm7msYse/0q+q7QNadQKjFp/X3cYQ=; b=d1bnIATGQi8mP5i77Tm6QGYWkQhOSvGA4+U2+do6U8W3Wdbtm3hU+yHBXNTYyhEbKF WLEsp9DqK44Rd/Zu8oAAPGEJZ2c6+NBUbhW/UOyZpoxt0MXh5Wx9lYV8Zg9QltK6+U42 8aIWLxvjknOqrZGnfuFBfJnADg+Ut7Yx+HBPCMxTeY7XVETpingq7U3C08NckE8aLu1w hUOELWuvnhRRstDsoUWEEQF3mcfSXpVw8UF4RTjQprOmZ4yhyXWv7ClPi4m9IXotbhRx fKeU+bdIa5alzBaB1hAYXEXs/QqOJVvZPiizsidKHb5pwTRsJgw6Rnso7vbLf0fbL3W8 oHqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679708804; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Wcn3NqTFR0W1dzgm7msYse/0q+q7QNadQKjFp/X3cYQ=; b=OVNvTSitQXRqSN8nLwB4kGqdbr2c1eEaqM1GeyyAXqEO+2RtpFc+xS74mBq2j9nKgG /pqDcBQpIV8fnNVyDDr4uRwLYN/xjG+UiA0cbLsSK1viev+IIcuS2BYm0FWhX3a55GnO heCwwyxylp/ZZ/jG8LTcgR6YyGOCCvMF+0he314NrHzE2p8zHs7vrAhwXTXAQTn/HHZ4 2reBKFScNDbtVDq/NlUCnLmMps7/5VHzvFC/43z5NOJF6DIiK1jITmmvCdh4fSJi1NuC kFdSpbuepQ5w2BFfhhMP7kRbsyl1BZNOiCuhljU069Q6tGTFCMApzIANhmRg1HVMgzui OizQ== X-Gm-Message-State: AO0yUKWb5Z9+4qMKcaLmd7s2BXR7qgrKIH+LckHHK9snkYgR0QHH7orW 7WR/bvnyjTcIAg62zBTF4M2HvS4gs23uoHJm9R8jgsE9 X-Google-Smtp-Source: AK7set8rL+5H6Dd1PtEImnOMfZIiXejj+JIwR2vt/ey4hSmdT1fnwGgWKvoeiP3J+LtqOdLEsKJWXl4Hb6fttyd60+E= X-Received: by 2002:a9d:67d8:0:b0:69f:229:ce72 with SMTP id c24-20020a9d67d8000000b0069f0229ce72mr1689541otn.2.1679708803744; Fri, 24 Mar 2023 18:46:43 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Received: by 2002:ac9:5705:0:b0:49c:b071:b1e3 with HTTP; Fri, 24 Mar 2023 18:46:43 -0700 (PDT) In-Reply-To: References: <8173cc7e-e934-dd5c-312a-1dfa886941aa@FreeBSD.org> <8cfdb951-9b1f-ecd3-2291-7a528e1b042c@m5p.com> From: Mateusz Guzik Date: Sat, 25 Mar 2023 02:46:43 +0100 Message-ID: Subject: Re: Periodic rant about SCHED_ULE To: sgk@troutmask.apl.washington.edu Cc: Matthias Andree , freebsd-hackers@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Spamd-Result: default: False [-3.90 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.90)[-0.900]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; MIME_GOOD(-0.10)[text/plain]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::32c:from]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; FROM_HAS_DN(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; RCVD_TLS_LAST(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-Rspamd-Queue-Id: 4Pk24R03rLz46qc X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N On 3/23/23, Mateusz Guzik wrote: > On 3/22/23, Mateusz Guzik wrote: >> On 3/22/23, Steve Kargl wrote: >>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote: >>>> >>>> Yes, there are reports that FreeBSD is not responsive by default - but >>>> this >>>> may make it get overall better throughput at the expense of >>>> responsiveness, >>>> because it might be doing fewer context switches. So just complaining >>>> about >>>> a longer buildworld without seeing how much dnetc did in the same >>>> wallclock >>>> time period is useless. Periodic rant's don't fix this lack of >>>> information. >>>> >>> >>> I reported the issue with ULE some 15 to 20 years ago. >>> I gave up reporting the issue. The individuals with the >>> requisite skills to hack on ULE did not; and yes, I lack >>> those skills. The path of least resistance is to use >>> 4BSD. >>> >>> % cat a.f90 >>> ! >>> ! Silly numerically intensive computation. >>> ! >>> program foo >>> implicit none >>> integer, parameter :: m = 200, n = 1000, dp = kind(1.d0) >>> integer i >>> real(dp) x >>> real(dp), allocatable :: a(:,:), b(:,:), c(:,:) >>> call random_init(.true., .true.) >>> allocate(a(n,n), b(n,n)) >>> do i = 1, m >>> call random_number(a) >>> call random_number(b) >>> c = matmul(a,b) >>> x = sum(c) >>> if (x < 0) stop 'Whoops' >>> end do >>> end program foo >>> % gfortran11 -o z -O3 -march=native a.f90 >>> % time ./z >>> 42.16 real 42.04 user 0.09 sys >>> % cat foo >>> #! /bin/csh >>> # >>> # Launch NCPU+1 images with a 1 second delay >>> # >>> foreach i (1 2 3 4 5 6 7 8 9) >>> ./z & >>> sleep 1 >>> end >>> % ./foo >>> >>> In another xterm, you can watch the 9 images. >>> >>> % top >>> st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46 >>> 11:43:01 >>> 74 processes: 10 running, 64 sleeping >>> CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle >>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G >>> Free >>> Swap: 16G Total, 16G Free >>> >>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU >>> COMMAND >>> 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% z >>> 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% z >>> 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% z >>> 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% z >>> 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% z >>> 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% z >>> 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% z >>> 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% z >>> 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% z >>> >>> With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's >>> exit >>> after 55-ish seconds. If you try this experiment on ULE, you'll get >>> NCPU-1 >>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the >>> two images ping-pong on one cpu. Back when I was testing ULE vs 4BSD, >>> this was/is due to ULE's cpu affinity where processes never migrate to >>> another cpu. Admittedly, this was several years ago. Maybe ULE has >>> gotten better, but George's rant seems to suggest otherwise. >>> >> >> While I have not tried openmpi yet, I can confirm there is a problem >> here, but the situtation is not as clear cut as one might think. >> >> sched_4bsd *round robins* all workers across all CPUs, which comes at >> a performance *hit* compared to ule if number of workers is <= CPU >> count -- here ule maintains affinity, avoiding cache busting. If you >> slap in $cpu_count + 1 workers, 4bsd keeps the round robin equally >> penalizing everyone, while ule mostly penalizes a select victim. By >> the end of it, you get lower total cpu time spent, but higher total >> real time. See below for an example. >> >> Two massive problems with 4bsd, apart from mandatory round robin which >> also happens to help by accident: >> 1. it has one *global* lock, meaning the scheduler itself just does >> not scale and this is visible at modest contemporary scales >> 2. it does not understand topology -- no accounting done for ht nor >> numa, but i concede the latter wont be a factor for most people >> >> That said, ULE definitely has performance bugs which need to be fixed. >> At least for the case below, 4BSD just "lucked" into sucking less >> simply because it is so primitive. >> >> I wrote a cpu burning program (memset 1 MB in a loop, with enough >> iterations to take ~20 seconds on its own). >> >> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct >> cores. >> >> The test runs *9* workers, here is a sample run: >> >> 4bsd: >> 23.18 real 20.81 user 0.00 sys >> 23.26 real 20.81 user 0.00 sys >> 23.30 real 20.81 user 0.00 sys >> 23.34 real 20.82 user 0.00 sys >> 23.41 real 20.81 user 0.00 sys >> 23.41 real 20.80 user 0.00 sys >> 23.42 real 20.80 user 0.00 sys >> 23.53 real 20.81 user 0.00 sys >> 23.60 real 20.80 user 0.00 sys >> 187.31s user 0.02s system 793% cpu 23.606 total >> >> ule: >> 20.67 real 20.04 user 0.00 sys >> 20.97 real 20.00 user 0.00 sys >> 21.45 real 20.29 user 0.00 sys >> 21.51 real 20.22 user 0.00 sys >> 22.77 real 20.04 user 0.00 sys >> 22.78 real 20.26 user 0.00 sys >> 23.42 real 20.04 user 0.00 sys >> 24.07 real 20.30 user 0.00 sys >> 24.46 real 20.16 user 0.00 sys >> 181.41s user 0.07s system 741% cpu 24.465 total >> >> It reliably uses 187s user time on 4BSD and 181s on ULE. At the same >> time it also reliably has massive time imblance between >> fastest/slowest in terms of total real time between workers *and* ULE >> reliably uses more total real time to finish the entire thing. >> >> iow this is a tradeoff, but most likely a bad one >> > > So I also ran the following setup: 8 core vm doing -j 8 buildkernel, > while 8 nice -n 20 processes are cpu-bound. After the build ends > workers report how many ops they did in that time. > > ule is way off the reservation here. > > unimpeded build takes: 441.39 real 3135.63 user 266.92, similar on > both schedulers > > with cpu hoggers: > 4bsd: 657.22 real 3152.02 user 253.30 sys [+49%] > ule: 4427.69 real 3225.33 user 194.86 sys [+903%] > > ule spends less time in the kernel, but the time blows up -- over 10 x > the base line. > this is clearly a total non-starter. > > full stats: > > 4bsd: > hogger pid/ops > 58315: 5546013 > 58322: 5557294 > 58321: 5545052 > 58313: 5546347 > 58318: 5537874 > 58317: 5555303 > 58323: 5545116 > 58320: 5548530 > > runtimes: > > 657.23 real 230.02 user 0.02 sys > 657.23 real 229.83 user 0.00 sys > 657.23 real 230.50 user 0.00 sys > 657.23 real 230.53 user 0.00 sys > 657.23 real 230.14 user 0.01 sys > 657.23 real 230.19 user 0.00 sys > 657.23 real 230.09 user 0.00 sys > 657.23 real 230.10 user 0.03 sys > > kernel build: > 657.22 real 3152.02 user 253.30 sys > > ule: > hogger pid/ops > 77794: 95916836 > 77792: 95909794 > 77789: 96454064 > 77796: 95813778 > 77791: 96728121 > 77795: 96678291 > 77798: 97258060 > 77797: 96347984 > > 4427.70 real 4001.94 user 0.10 sys > 4427.70 real 3980.68 user 0.16 sys > 4427.70 real 3973.96 user 0.10 sys > 4427.70 real 3980.11 user 0.13 sys > 4427.70 real 4012.32 user 0.07 sys > 4427.70 real 4008.79 user 0.12 sys > 4427.70 real 4034.77 user 0.09 sys > 4427.70 real 3995.40 user 0.08 sys > > kernel build: > 4427.69 real 3225.33 user 194.86 sys > added a probe to runq_add*: diff --git a/sys/kern/kern_switch.c b/sys/kern/kern_switch.c index aee0c9cbd1ae..db73226547c5 100644 --- a/sys/kern/kern_switch.c +++ b/sys/kern/kern_switch.c @@ -357,6 +357,9 @@ runq_setbit(struct runq *rq, int pri) rqb->rqb_bits[RQB_WORD(pri)] |= RQB_BIT(pri); } +SDT_PROVIDER_DECLARE(sched); +SDT_PROBE_DEFINE3(sched, , , runq_add, "struct thread *", "int", "int"); + /* * Add the thread to the queue specified by its priority, and set the * corresponding status bit. @@ -378,6 +381,7 @@ runq_add(struct runq *rq, struct thread *td, int flags) } else { TAILQ_INSERT_TAIL(rqh, td, td_runq); } + SDT_PROBE3(sched, , , runq_add, td, flags, pri); } void @@ -396,6 +400,7 @@ runq_add_pri(struct runq *rq, struct thread *td, u_char pri, int flags) } else { TAILQ_INSERT_TAIL(rqh, td, td_runq); } + SDT_PROBE3(sched, , , runq_add, td, flags, pri); } /* * Return true if there are runnable processes of any priority on the run and used this one-liner: dtrace -b 16M -x aggsize=32M -x dynvarsize=32M -n 'sched:::runq_add /args[0]->td_ucred->cr_uid == 2000/ { self->runq_t = timestamp; } sched:::on-cpu /self->runq_t/ { @runqlat["runq", execname == "cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp - self->runq_t); self->runq_t = 0; } sched:::on-cpu /curthread->td_ucred->cr_uid == 2000/ { self->oncpu_t = timestamp; } sched:::off-cpu /self->oncpu_t/ { @oncpu["oncpu", execname == "cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp - self->oncpu_t); } tick-300s { exit(0); } caped at 5 minutes because dtrace starts running into aggregation drops Key takeaway: 4bsd let's the cpu hog stay on cpu for longer than ule, but when it is taken off, it waits a long time to get back. in contrast, it gets back on very quick with ule and it is everyone else waiting big time. times in nanoseconds 4bsd: runq cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 2 8192 |@@@@@@@@@@ 4343 16384 |@@@@@ 2159 32768 |@ 363 65536 |@@@ 1359 131072 | 215 262144 | 129 524288 | 101 1048576 | 132 2097152 | 185 4194304 |@ 390 8388608 |@ 318 16777216 |@ 398 33554432 |@@ 838 67108864 |@@@@@@@ 3025 134217728 |@@@@@ 2474 268435456 |@@@ 1552 536870912 | 153 1073741824 | 0 runq rest value ------------- Distribution ------------- count 2048 | 0 4096 | 637 8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 364832 16384 |@@@ 52982 32768 |@ 11613 65536 |@@ 34286 131072 |@@@ 39734 262144 |@@ 23261 524288 |@ 21985 1048576 |@ 18999 2097152 |@ 10789 4194304 | 6239 8388608 | 4725 16777216 | 4598 33554432 | 4050 67108864 | 5857 134217728 | 3979 268435456 | 2024 536870912 | 2011 1073741824 | 1105 2147483648 | 841 4294967296 | 519 8589934592 | 372 17179869184 | 133 34359738368 | 37 68719476736 | 30 137438953472 | 1 274877906944 | 0 oncpu cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 20 8192 | 10 16384 | 19 32768 | 161 65536 | 137 131072 | 77 262144 | 104 524288 | 147 1048576 |@ 301 2097152 |@ 482 4194304 |@@@ 1178 8388608 |@@@@@@@@@@@@@@@ 6971 16777216 |@ 474 33554432 |@ 669 67108864 |@@@@@@@@@@@@@@@@ 7299 134217728 | 14 268435456 | 38 536870912 | 38 1073741824 | 2 2147483648 | 2 4294967296 | 0 oncpu rest value ------------- Distribution ------------- count 512 | 0 1024 | 102 2048 |@@@@@@@@@@@ 146555 4096 |@@ 23373 8192 |@@@ 45165 16384 |@ 14531 32768 |@@@@@ 67664 65536 |@@@@@@ 80155 131072 |@@@@@ 64609 262144 |@@ 21509 524288 |@@ 23393 1048576 |@@ 21058 2097152 |@ 7854 4194304 |@ 8788 8388608 |@ 20242 16777216 | 2352 33554432 | 2084 67108864 | 4388 134217728 | 1123 268435456 | 755 536870912 | 135 1073741824 | 2 2147483648 | 0 ule: runq cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 2 8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 37193 16384 |@@@@@ 8612 32768 |@ 1481 65536 |@@@@@ 8430 131072 |@@ 3102 262144 |@ 938 524288 | 457 1048576 |@ 2063 2097152 | 257 4194304 | 41 8388608 | 428 16777216 | 12 33554432 | 1 67108864 | 2 134217728 | 0 runq rest value ------------- Distribution ------------- count 4096 | 0 8192 |@ 649 16384 |@@@@ 1953 es 32768 |@ 539 65536 |@@@@@ 2369 131072 |@@@@ 1904 262144 |@ 471 524288 | 131 1048576 |@ 458 2097152 |@ 443 4194304 |@ 547 8388608 |@ 632 16777216 |@@ 984 33554432 |@@@ 1606 67108864 |@@@@@@@@ 3894 134217728 |@ 511 268435456 |@ 489 536870912 |@ 445 1073741824 |@ 475 2147483648 |@ 314 4294967296 | 188 8589934592 | 136 17179869184 | 100 34359738368 | 81 68719476736 | 39 137438953472 | 3 274877906944 | 0 oncpu cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 13 8192 | 311 16384 |@@ 2369 32768 |@ 853 65536 |@ 2302 131072 |@ 1302 262144 | 518 524288 |@ 872 1048576 |@ 1172 2097152 |@ 1435 4194304 |@@ 2706 8388608 |@@@@@@@@@@@@@@@@@@@ 29669 16777216 |@@ 2733 33554432 |@@ 3910 67108864 |@@@@@@ 9991 134217728 |@ 1800 268435456 |@ 840 536870912 | 200 1073741824 | 15 2147483648 | 7 4294967296 | 1 8589934592 | 0 oncpu rest value ------------- Distribution ------------- count 512 | 0 1024 |@ 419 2048 |@@@@@@@@@ 5730 4096 |@ 778 8192 |@@@@@ 3233 16384 |@@ 1400 32768 |@@@@ 2739 65536 |@@@@@@@@@@@@ 7550 131072 |@@@ 1720 262144 | 151 524288 | 112 1048576 |@@@ 1688 2097152 | 174 4194304 | 86 8388608 |@ 411 16777216 | 7 33554432 | 3 67108864 | 0 -- Mateusz Guzik