Re: Periodic rant about SCHED_ULE
- Reply: Mateusz Guzik : "Re: Periodic rant about SCHED_ULE"
- In reply to: Mateusz Guzik : "Re: Periodic rant about SCHED_ULE"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 25 Mar 2023 01:46:43 UTC
On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote:
>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote:
>>>>
>>>> Yes, there are reports that FreeBSD is not responsive by default - but
>>>> this
>>>> may make it get overall better throughput at the expense of
>>>> responsiveness,
>>>> because it might be doing fewer context switches. So just complaining
>>>> about
>>>> a longer buildworld without seeing how much dnetc did in the same
>>>> wallclock
>>>> time period is useless. Periodic rant's don't fix this lack of
>>>> information.
>>>>
>>>
>>> I reported the issue with ULE some 15 to 20 years ago.
>>> I gave up reporting the issue. The individuals with the
>>> requisite skills to hack on ULE did not; and yes, I lack
>>> those skills. The path of least resistance is to use
>>> 4BSD.
>>>
>>> % cat a.f90
>>> !
>>> ! Silly numerically intensive computation.
>>> !
>>> program foo
>>> implicit none
>>> integer, parameter :: m = 200, n = 1000, dp = kind(1.d0)
>>> integer i
>>> real(dp) x
>>> real(dp), allocatable :: a(:,:), b(:,:), c(:,:)
>>> call random_init(.true., .true.)
>>> allocate(a(n,n), b(n,n))
>>> do i = 1, m
>>> call random_number(a)
>>> call random_number(b)
>>> c = matmul(a,b)
>>> x = sum(c)
>>> if (x < 0) stop 'Whoops'
>>> end do
>>> end program foo
>>> % gfortran11 -o z -O3 -march=native a.f90
>>> % time ./z
>>> 42.16 real 42.04 user 0.09 sys
>>> % cat foo
>>> #! /bin/csh
>>> #
>>> # Launch NCPU+1 images with a 1 second delay
>>> #
>>> foreach i (1 2 3 4 5 6 7 8 9)
>>> ./z &
>>> sleep 1
>>> end
>>> % ./foo
>>>
>>> In another xterm, you can watch the 9 images.
>>>
>>> % top
>>> st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46
>>> 11:43:01
>>> 74 processes: 10 running, 64 sleeping
>>> CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle
>>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G
>>> Free
>>> Swap: 16G Total, 16G Free
>>>
>>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU
>>> COMMAND
>>> 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% z
>>> 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% z
>>> 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% z
>>> 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% z
>>> 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% z
>>> 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% z
>>> 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% z
>>> 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% z
>>> 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% z
>>>
>>> With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's
>>> exit
>>> after 55-ish seconds. If you try this experiment on ULE, you'll get
>>> NCPU-1
>>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the
>>> two images ping-pong on one cpu. Back when I was testing ULE vs 4BSD,
>>> this was/is due to ULE's cpu affinity where processes never migrate to
>>> another cpu. Admittedly, this was several years ago. Maybe ULE has
>>> gotten better, but George's rant seems to suggest otherwise.
>>>
>>
>> While I have not tried openmpi yet, I can confirm there is a problem
>> here, but the situtation is not as clear cut as one might think.
>>
>> sched_4bsd *round robins* all workers across all CPUs, which comes at
>> a performance *hit* compared to ule if number of workers is <= CPU
>> count -- here ule maintains affinity, avoiding cache busting. If you
>> slap in $cpu_count + 1 workers, 4bsd keeps the round robin equally
>> penalizing everyone, while ule mostly penalizes a select victim. By
>> the end of it, you get lower total cpu time spent, but higher total
>> real time. See below for an example.
>>
>> Two massive problems with 4bsd, apart from mandatory round robin which
>> also happens to help by accident:
>> 1. it has one *global* lock, meaning the scheduler itself just does
>> not scale and this is visible at modest contemporary scales
>> 2. it does not understand topology -- no accounting done for ht nor
>> numa, but i concede the latter wont be a factor for most people
>>
>> That said, ULE definitely has performance bugs which need to be fixed.
>> At least for the case below, 4BSD just "lucked" into sucking less
>> simply because it is so primitive.
>>
>> I wrote a cpu burning program (memset 1 MB in a loop, with enough
>> iterations to take ~20 seconds on its own).
>>
>> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct
>> cores.
>>
>> The test runs *9* workers, here is a sample run:
>>
>> 4bsd:
>> 23.18 real 20.81 user 0.00 sys
>> 23.26 real 20.81 user 0.00 sys
>> 23.30 real 20.81 user 0.00 sys
>> 23.34 real 20.82 user 0.00 sys
>> 23.41 real 20.81 user 0.00 sys
>> 23.41 real 20.80 user 0.00 sys
>> 23.42 real 20.80 user 0.00 sys
>> 23.53 real 20.81 user 0.00 sys
>> 23.60 real 20.80 user 0.00 sys
>> 187.31s user 0.02s system 793% cpu 23.606 total
>>
>> ule:
>> 20.67 real 20.04 user 0.00 sys
>> 20.97 real 20.00 user 0.00 sys
>> 21.45 real 20.29 user 0.00 sys
>> 21.51 real 20.22 user 0.00 sys
>> 22.77 real 20.04 user 0.00 sys
>> 22.78 real 20.26 user 0.00 sys
>> 23.42 real 20.04 user 0.00 sys
>> 24.07 real 20.30 user 0.00 sys
>> 24.46 real 20.16 user 0.00 sys
>> 181.41s user 0.07s system 741% cpu 24.465 total
>>
>> It reliably uses 187s user time on 4BSD and 181s on ULE. At the same
>> time it also reliably has massive time imblance between
>> fastest/slowest in terms of total real time between workers *and* ULE
>> reliably uses more total real time to finish the entire thing.
>>
>> iow this is a tradeoff, but most likely a bad one
>>
>
> So I also ran the following setup: 8 core vm doing -j 8 buildkernel,
> while 8 nice -n 20 processes are cpu-bound. After the build ends
> workers report how many ops they did in that time.
>
> ule is way off the reservation here.
>
> unimpeded build takes: 441.39 real 3135.63 user 266.92, similar on
> both schedulers
>
> with cpu hoggers:
> 4bsd: 657.22 real 3152.02 user 253.30 sys [+49%]
> ule: 4427.69 real 3225.33 user 194.86 sys [+903%]
>
> ule spends less time in the kernel, but the time blows up -- over 10 x
> the base line.
> this is clearly a total non-starter.
>
> full stats:
>
> 4bsd:
> hogger pid/ops
> 58315: 5546013
> 58322: 5557294
> 58321: 5545052
> 58313: 5546347
> 58318: 5537874
> 58317: 5555303
> 58323: 5545116
> 58320: 5548530
>
> runtimes:
>
> 657.23 real 230.02 user 0.02 sys
> 657.23 real 229.83 user 0.00 sys
> 657.23 real 230.50 user 0.00 sys
> 657.23 real 230.53 user 0.00 sys
> 657.23 real 230.14 user 0.01 sys
> 657.23 real 230.19 user 0.00 sys
> 657.23 real 230.09 user 0.00 sys
> 657.23 real 230.10 user 0.03 sys
>
> kernel build:
> 657.22 real 3152.02 user 253.30 sys
>
> ule:
> hogger pid/ops
> 77794: 95916836
> 77792: 95909794
> 77789: 96454064
> 77796: 95813778
> 77791: 96728121
> 77795: 96678291
> 77798: 97258060
> 77797: 96347984
>
> 4427.70 real 4001.94 user 0.10 sys
> 4427.70 real 3980.68 user 0.16 sys
> 4427.70 real 3973.96 user 0.10 sys
> 4427.70 real 3980.11 user 0.13 sys
> 4427.70 real 4012.32 user 0.07 sys
> 4427.70 real 4008.79 user 0.12 sys
> 4427.70 real 4034.77 user 0.09 sys
> 4427.70 real 3995.40 user 0.08 sys
>
> kernel build:
> 4427.69 real 3225.33 user 194.86 sys
>
added a probe to runq_add*:
diff --git a/sys/kern/kern_switch.c b/sys/kern/kern_switch.c
index aee0c9cbd1ae..db73226547c5 100644
--- a/sys/kern/kern_switch.c
+++ b/sys/kern/kern_switch.c
@@ -357,6 +357,9 @@ runq_setbit(struct runq *rq, int pri)
rqb->rqb_bits[RQB_WORD(pri)] |= RQB_BIT(pri);
}
+SDT_PROVIDER_DECLARE(sched);
+SDT_PROBE_DEFINE3(sched, , , runq_add, "struct thread *", "int", "int");
+
/*
* Add the thread to the queue specified by its priority, and set the
* corresponding status bit.
@@ -378,6 +381,7 @@ runq_add(struct runq *rq, struct thread *td, int flags)
} else {
TAILQ_INSERT_TAIL(rqh, td, td_runq);
}
+ SDT_PROBE3(sched, , , runq_add, td, flags, pri);
}
void
@@ -396,6 +400,7 @@ runq_add_pri(struct runq *rq, struct thread *td,
u_char pri, int flags)
} else {
TAILQ_INSERT_TAIL(rqh, td, td_runq);
}
+ SDT_PROBE3(sched, , , runq_add, td, flags, pri);
}
/*
* Return true if there are runnable processes of any priority on the run
and used this one-liner:
dtrace -b 16M -x aggsize=32M -x dynvarsize=32M -n 'sched:::runq_add
/args[0]->td_ucred->cr_uid == 2000/ { self->runq_t = timestamp; }
sched:::on-cpu /self->runq_t/ { @runqlat["runq", execname ==
"cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp -
self->runq_t); self->runq_t = 0; } sched:::on-cpu
/curthread->td_ucred->cr_uid == 2000/ { self->oncpu_t = timestamp; }
sched:::off-cpu /self->oncpu_t/ { @oncpu["oncpu", execname ==
"cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp -
self->oncpu_t); } tick-300s { exit(0); }
caped at 5 minutes because dtrace starts running into aggregation drops
Key takeaway: 4bsd let's the cpu hog stay on cpu for longer than ule,
but when it is taken off, it waits a long time to get back. in
contrast, it gets back on very quick with ule and it is everyone else
waiting big time.
times in nanoseconds
4bsd:
runq cpuburner
value ------------- Distribution ------------- count
2048 | 0
4096 | 2
8192 |@@@@@@@@@@ 4343
16384 |@@@@@ 2159
32768 |@ 363
65536 |@@@ 1359
131072 | 215
262144 | 129
524288 | 101
1048576 | 132
2097152 | 185
4194304 |@ 390
8388608 |@ 318
16777216 |@ 398
33554432 |@@ 838
67108864 |@@@@@@@ 3025
134217728 |@@@@@ 2474
268435456 |@@@ 1552
536870912 | 153
1073741824 | 0
runq rest
value ------------- Distribution ------------- count
2048 | 0
4096 | 637
8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 364832
16384 |@@@ 52982
32768 |@ 11613
65536 |@@ 34286
131072 |@@@ 39734
262144 |@@ 23261
524288 |@ 21985
1048576 |@ 18999
2097152 |@ 10789
4194304 | 6239
8388608 | 4725
16777216 | 4598
33554432 | 4050
67108864 | 5857
134217728 | 3979
268435456 | 2024
536870912 | 2011
1073741824 | 1105
2147483648 | 841
4294967296 | 519
8589934592 | 372
17179869184 | 133
34359738368 | 37
68719476736 | 30
137438953472 | 1
274877906944 | 0
oncpu cpuburner
value ------------- Distribution ------------- count
2048 | 0
4096 | 20
8192 | 10
16384 | 19
32768 | 161
65536 | 137
131072 | 77
262144 | 104
524288 | 147
1048576 |@ 301
2097152 |@ 482
4194304 |@@@ 1178
8388608 |@@@@@@@@@@@@@@@ 6971
16777216 |@ 474
33554432 |@ 669
67108864 |@@@@@@@@@@@@@@@@ 7299
134217728 | 14
268435456 | 38
536870912 | 38
1073741824 | 2
2147483648 | 2
4294967296 | 0
oncpu rest
value ------------- Distribution ------------- count
512 | 0
1024 | 102
2048 |@@@@@@@@@@@ 146555
4096 |@@ 23373
8192 |@@@ 45165
16384 |@ 14531
32768 |@@@@@ 67664
65536 |@@@@@@ 80155
131072 |@@@@@ 64609
262144 |@@ 21509
524288 |@@ 23393
1048576 |@@ 21058
2097152 |@ 7854
4194304 |@ 8788
8388608 |@ 20242
16777216 | 2352
33554432 | 2084
67108864 | 4388
134217728 | 1123
268435456 | 755
536870912 | 135
1073741824 | 2
2147483648 | 0
ule:
runq cpuburner
value ------------- Distribution ------------- count
2048 | 0
4096 | 2
8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 37193
16384 |@@@@@ 8612
32768 |@ 1481
65536 |@@@@@ 8430
131072 |@@ 3102
262144 |@ 938
524288 | 457
1048576 |@ 2063
2097152 | 257
4194304 | 41
8388608 | 428
16777216 | 12
33554432 | 1
67108864 | 2
134217728 | 0
runq rest
value ------------- Distribution ------------- count
4096 | 0
8192 |@ 649
16384 |@@@@ 1953
es
32768 |@ 539
65536 |@@@@@ 2369
131072 |@@@@ 1904
262144 |@ 471
524288 | 131
1048576 |@ 458
2097152 |@ 443
4194304 |@ 547
8388608 |@ 632
16777216 |@@ 984
33554432 |@@@ 1606
67108864 |@@@@@@@@ 3894
134217728 |@ 511
268435456 |@ 489
536870912 |@ 445
1073741824 |@ 475
2147483648 |@ 314
4294967296 | 188
8589934592 | 136
17179869184 | 100
34359738368 | 81
68719476736 | 39
137438953472 | 3
274877906944 | 0
oncpu cpuburner
value ------------- Distribution ------------- count
2048 | 0
4096 | 13
8192 | 311
16384 |@@ 2369
32768 |@ 853
65536 |@ 2302
131072 |@ 1302
262144 | 518
524288 |@ 872
1048576 |@ 1172
2097152 |@ 1435
4194304 |@@ 2706
8388608 |@@@@@@@@@@@@@@@@@@@ 29669
16777216 |@@ 2733
33554432 |@@ 3910
67108864 |@@@@@@ 9991
134217728 |@ 1800
268435456 |@ 840
536870912 | 200
1073741824 | 15
2147483648 | 7
4294967296 | 1
8589934592 | 0
oncpu rest
value ------------- Distribution ------------- count
512 | 0
1024 |@ 419
2048 |@@@@@@@@@ 5730
4096 |@ 778
8192 |@@@@@ 3233
16384 |@@ 1400
32768 |@@@@ 2739
65536 |@@@@@@@@@@@@ 7550
131072 |@@@ 1720
262144 | 151
524288 | 112
1048576 |@@@ 1688
2097152 | 174
4194304 | 86
8388608 |@ 411
16777216 | 7
33554432 | 3
67108864 | 0
--
Mateusz Guzik <mjguzik gmail.com>