Re: Periodic rant about SCHED_ULE

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Sat, 25 Mar 2023 01:46:43 UTC
On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote:
>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote:
>>>>
>>>> Yes, there are reports that FreeBSD is not responsive by default - but
>>>> this
>>>> may make it get overall better throughput at the expense of
>>>> responsiveness,
>>>> because it might be doing fewer context switches.  So just complaining
>>>> about
>>>> a longer buildworld without seeing how much dnetc did in the same
>>>> wallclock
>>>> time period is useless.  Periodic rant's don't fix this lack of
>>>> information.
>>>>
>>>
>>> I reported the issue with ULE some 15 to 20 years ago.
>>> I gave up reporting the issue.  The individuals with the
>>> requisite skills to hack on ULE did not; and yes, I lack
>>> those skills.  The path of least resistance is to use
>>> 4BSD.
>>>
>>> %  cat a.f90
>>> !
>>> ! Silly numerically intensive computation.
>>> !
>>> program foo
>>>    implicit none
>>>    integer, parameter :: m = 200, n = 1000, dp = kind(1.d0)
>>>    integer i
>>>    real(dp) x
>>>    real(dp), allocatable :: a(:,:), b(:,:), c(:,:)
>>>    call random_init(.true., .true.)
>>>    allocate(a(n,n), b(n,n))
>>>    do i = 1, m
>>>       call random_number(a)
>>>       call random_number(b)
>>>       c = matmul(a,b)
>>>       x = sum(c)
>>>       if (x < 0) stop 'Whoops'
>>>    end do
>>> end program foo
>>> % gfortran11 -o z -O3 -march=native a.f90
>>> % time ./z
>>>        42.16 real        42.04 user         0.09 sys
>>> % cat foo
>>> #! /bin/csh
>>> #
>>> # Launch NCPU+1 images with a 1 second delay
>>> #
>>> foreach i (1 2 3 4 5 6 7 8 9)
>>>    ./z &
>>>    sleep 1
>>> end
>>> % ./foo
>>>
>>> In another xterm, you can watch the 9 images.
>>>
>>> % top
>>> st pid:  1709;  load averages:  4.90,  1.61,  0.79    up 0+00:56:46
>>> 11:43:01
>>> 74 processes:  10 running, 64 sleeping
>>> CPU: 99.9% user,  0.0% nice,  0.1% system,  0.0% interrupt,  0.0% idle
>>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G
>>> Free
>>> Swap: 16G Total, 16G Free
>>>
>>>   PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME     CPU
>>> COMMAND
>>>  1699 kargl         1  56    0    68M    35M RUN      3   0:41  92.60% z
>>>  1701 kargl         1  56    0    68M    35M RUN      0   0:41  92.33% z
>>>  1689 kargl         1  56    0    68M    35M CPU5     5   0:47  91.63% z
>>>  1691 kargl         1  56    0    68M    35M CPU0     0   0:45  89.91% z
>>>  1695 kargl         1  56    0    68M    35M CPU2     2   0:43  88.56% z
>>>  1697 kargl         1  56    0    68M    35M CPU6     6   0:42  88.48% z
>>>  1705 kargl         1  55    0    68M    35M CPU1     1   0:39  88.12% z
>>>  1703 kargl         1  56    0    68M    35M CPU4     4   0:39  87.86% z
>>>  1693 kargl         1  56    0    68M    35M CPU7     7   0:45  78.12% z
>>>
>>> With 4BSD, you see the ./z's with 80% or greater CPU.  All the ./z's
>>> exit
>>> after 55-ish seconds.  If you try this experiment on ULE, you'll get
>>> NCPU-1
>>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the
>>> two images ping-pong on one cpu.  Back when I was testing ULE vs 4BSD,
>>> this was/is due to ULE's cpu affinity where processes never migrate to
>>> another cpu.  Admittedly, this was several years ago.  Maybe ULE has
>>> gotten better, but George's rant seems to suggest otherwise.
>>>
>>
>> While I have not tried openmpi yet, I can confirm there is a problem
>> here, but the situtation is not as clear cut as one might think.
>>
>> sched_4bsd *round robins* all workers across all CPUs, which comes at
>> a performance *hit* compared to ule if number of workers is <= CPU
>> count -- here ule maintains affinity, avoiding cache busting. If you
>> slap in $cpu_count + 1 workers, 4bsd keeps the round robin equally
>> penalizing everyone, while ule mostly penalizes a select victim. By
>> the end of it, you get lower total cpu time spent, but higher total
>> real time. See below for an example.
>>
>> Two massive problems with 4bsd, apart from mandatory round robin which
>> also happens to help by accident:
>> 1. it has one *global* lock, meaning the scheduler itself just does
>> not scale and this is visible at modest contemporary scales
>> 2. it does not understand topology -- no accounting done for ht nor
>> numa, but i concede the latter wont be a factor for most people
>>
>> That said, ULE definitely has performance bugs which need to be fixed.
>> At least for the case below, 4BSD just "lucked" into sucking less
>> simply because it is so primitive.
>>
>> I wrote a cpu burning program (memset 1 MB in a loop, with enough
>> iterations to take ~20 seconds on its own).
>>
>> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct
>> cores.
>>
>> The test runs *9* workers, here is a sample run:
>>
>> 4bsd:
>>        23.18 real        20.81 user         0.00 sys
>>        23.26 real        20.81 user         0.00 sys
>>        23.30 real        20.81 user         0.00 sys
>>        23.34 real        20.82 user         0.00 sys
>>        23.41 real        20.81 user         0.00 sys
>>        23.41 real        20.80 user         0.00 sys
>>        23.42 real        20.80 user         0.00 sys
>>        23.53 real        20.81 user         0.00 sys
>>        23.60 real        20.80 user         0.00 sys
>> 187.31s user 0.02s system 793% cpu 23.606 total
>>
>> ule:
>>        20.67 real        20.04 user         0.00 sys
>>        20.97 real        20.00 user         0.00 sys
>>        21.45 real        20.29 user         0.00 sys
>>        21.51 real        20.22 user         0.00 sys
>>        22.77 real        20.04 user         0.00 sys
>>        22.78 real        20.26 user         0.00 sys
>>        23.42 real        20.04 user         0.00 sys
>>        24.07 real        20.30 user         0.00 sys
>>        24.46 real        20.16 user         0.00 sys
>> 181.41s user 0.07s system 741% cpu 24.465 total
>>
>> It reliably uses 187s user time on 4BSD and 181s on ULE. At the same
>> time it also reliably has massive time imblance between
>> fastest/slowest in terms of total real time between workers *and* ULE
>> reliably uses more total real time to finish the entire thing.
>>
>> iow this is a tradeoff, but most likely a bad one
>>
>
> So I also ran the following setup: 8 core vm doing -j 8 buildkernel,
> while 8 nice -n 20 processes are cpu-bound. After the build ends
> workers report how many ops they did in that time.
>
> ule is way off the reservation here.
>
> unimpeded build takes: 441.39 real 3135.63 user 266.92, similar on
> both schedulers
>
> with cpu hoggers:
> 4bsd:       657.22 real      3152.02 user       253.30 sys [+49%]
> ule:        4427.69 real      3225.33 user       194.86 sys [+903%]
>
> ule spends less time in the kernel, but the time blows up -- over 10 x
> the base line.
> this is clearly a total non-starter.
>
> full stats:
>
> 4bsd:
> hogger pid/ops
> 58315: 5546013
> 58322: 5557294
> 58321: 5545052
> 58313: 5546347
> 58318: 5537874
> 58317: 5555303
> 58323: 5545116
> 58320: 5548530
>
> runtimes:
>
>       657.23 real       230.02 user         0.02 sys
>       657.23 real       229.83 user         0.00 sys
>       657.23 real       230.50 user         0.00 sys
>       657.23 real       230.53 user         0.00 sys
>       657.23 real       230.14 user         0.01 sys
>       657.23 real       230.19 user         0.00 sys
>       657.23 real       230.09 user         0.00 sys
>       657.23 real       230.10 user         0.03 sys
>
> kernel build:
>       657.22 real      3152.02 user       253.30 sys
>
> ule:
> hogger pid/ops
> 77794: 95916836
> 77792: 95909794
> 77789: 96454064
> 77796: 95813778
> 77791: 96728121
> 77795: 96678291
> 77798: 97258060
> 77797: 96347984
>
>      4427.70 real      4001.94 user         0.10 sys
>      4427.70 real      3980.68 user         0.16 sys
>      4427.70 real      3973.96 user         0.10 sys
>      4427.70 real      3980.11 user         0.13 sys
>      4427.70 real      4012.32 user         0.07 sys
>      4427.70 real      4008.79 user         0.12 sys
>      4427.70 real      4034.77 user         0.09 sys
>      4427.70 real      3995.40 user         0.08 sys
>
> kernel build:
>      4427.69 real      3225.33 user       194.86 sys
>

added a probe to runq_add*:
diff --git a/sys/kern/kern_switch.c b/sys/kern/kern_switch.c
index aee0c9cbd1ae..db73226547c5 100644
--- a/sys/kern/kern_switch.c
+++ b/sys/kern/kern_switch.c
@@ -357,6 +357,9 @@ runq_setbit(struct runq *rq, int pri)
        rqb->rqb_bits[RQB_WORD(pri)] |= RQB_BIT(pri);
 }

+SDT_PROVIDER_DECLARE(sched);
+SDT_PROBE_DEFINE3(sched, , , runq_add, "struct thread *", "int", "int");
+
 /*
  * Add the thread to the queue specified by its priority, and set the
  * corresponding status bit.
@@ -378,6 +381,7 @@ runq_add(struct runq *rq, struct thread *td, int flags)
        } else {
                TAILQ_INSERT_TAIL(rqh, td, td_runq);
        }
+       SDT_PROBE3(sched, , , runq_add, td, flags, pri);
 }

 void
@@ -396,6 +400,7 @@ runq_add_pri(struct runq *rq, struct thread *td,
u_char pri, int flags)
        } else {
                TAILQ_INSERT_TAIL(rqh, td, td_runq);
        }
+       SDT_PROBE3(sched, , , runq_add, td, flags, pri);
 }
 /*
  * Return true if there are runnable processes of any priority on the run

and used this one-liner:

dtrace -b 16M -x aggsize=32M -x dynvarsize=32M -n 'sched:::runq_add
/args[0]->td_ucred->cr_uid == 2000/ { self->runq_t = timestamp; }
sched:::on-cpu /self->runq_t/ { @runqlat["runq", execname ==
"cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp -
self->runq_t); self->runq_t = 0; } sched:::on-cpu
/curthread->td_ucred->cr_uid == 2000/ { self->oncpu_t = timestamp; }
sched:::off-cpu /self->oncpu_t/ { @oncpu["oncpu", execname ==
"cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp -
self->oncpu_t); } tick-300s { exit(0); }

caped at 5 minutes because dtrace starts running into aggregation drops

Key takeaway: 4bsd let's the cpu hog stay on cpu for longer than ule,
but when it is taken off, it waits a long time to get back. in
contrast, it gets back on very quick with ule and it is everyone else
waiting big time.

times in nanoseconds

4bsd:
  runq                                                cpuburner
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         2
            8192 |@@@@@@@@@@                               4343
           16384 |@@@@@                                    2159
           32768 |@                                        363
           65536 |@@@                                      1359
          131072 |                                         215
          262144 |                                         129
          524288 |                                         101
         1048576 |                                         132
         2097152 |                                         185
         4194304 |@                                        390
         8388608 |@                                        318
        16777216 |@                                        398
        33554432 |@@                                       838
        67108864 |@@@@@@@                                  3025
       134217728 |@@@@@                                    2474
       268435456 |@@@                                      1552
       536870912 |                                         153
      1073741824 |                                         0

  runq                                                rest
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         637
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@                 364832
           16384 |@@@                                      52982
           32768 |@                                        11613
           65536 |@@                                       34286
          131072 |@@@                                      39734
          262144 |@@                                       23261
          524288 |@                                        21985
         1048576 |@                                        18999
         2097152 |@                                        10789
         4194304 |                                         6239
         8388608 |                                         4725
        16777216 |                                         4598
        33554432 |                                         4050
        67108864 |                                         5857
       134217728 |                                         3979
       268435456 |                                         2024
       536870912 |                                         2011
      1073741824 |                                         1105
      2147483648 |                                         841
      4294967296 |                                         519
      8589934592 |                                         372
     17179869184 |                                         133
     34359738368 |                                         37
     68719476736 |                                         30
    137438953472 |                                         1
    274877906944 |                                         0

  oncpu                                               cpuburner
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         20
            8192 |                                         10
           16384 |                                         19
           32768 |                                         161
           65536 |                                         137
          131072 |                                         77
          262144 |                                         104
          524288 |                                         147
         1048576 |@                                        301
         2097152 |@                                        482
         4194304 |@@@                                      1178
         8388608 |@@@@@@@@@@@@@@@                          6971
        16777216 |@                                        474
        33554432 |@                                        669
        67108864 |@@@@@@@@@@@@@@@@                         7299
       134217728 |                                         14
       268435456 |                                         38
       536870912 |                                         38
      1073741824 |                                         2
      2147483648 |                                         2
      4294967296 |                                         0

  oncpu                                               rest
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |                                         102
            2048 |@@@@@@@@@@@                              146555
            4096 |@@                                       23373
            8192 |@@@                                      45165
           16384 |@                                        14531
           32768 |@@@@@                                    67664
           65536 |@@@@@@                                   80155
          131072 |@@@@@                                    64609
          262144 |@@                                       21509
          524288 |@@                                       23393
         1048576 |@@                                       21058
         2097152 |@                                        7854
         4194304 |@                                        8788
         8388608 |@                                        20242
        16777216 |                                         2352
        33554432 |                                         2084
        67108864 |                                         4388
       134217728 |                                         1123
       268435456 |                                         755
       536870912 |                                         135
      1073741824 |                                         2
      2147483648 |                                         0

ule:
  runq                                                cpuburner
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         2
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@                 37193
           16384 |@@@@@                                    8612
           32768 |@                                        1481
           65536 |@@@@@                                    8430
          131072 |@@                                       3102
          262144 |@                                        938
          524288 |                                         457
         1048576 |@                                        2063
         2097152 |                                         257
         4194304 |                                         41
         8388608 |                                         428
        16777216 |                                         12
        33554432 |                                         1
        67108864 |                                         2
       134217728 |                                         0

  runq                                                rest
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@                                        649
           16384 |@@@@                                     1953
                                               es
           32768 |@                                        539
           65536 |@@@@@                                    2369
          131072 |@@@@                                     1904
          262144 |@                                        471
          524288 |                                         131
         1048576 |@                                        458
         2097152 |@                                        443
         4194304 |@                                        547
         8388608 |@                                        632
        16777216 |@@                                       984
        33554432 |@@@                                      1606
        67108864 |@@@@@@@@                                 3894
       134217728 |@                                        511
       268435456 |@                                        489
       536870912 |@                                        445
      1073741824 |@                                        475
      2147483648 |@                                        314
      4294967296 |                                         188
      8589934592 |                                         136
     17179869184 |                                         100
     34359738368 |                                         81
     68719476736 |                                         39
    137438953472 |                                         3
    274877906944 |                                         0

  oncpu                                               cpuburner
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         13
            8192 |                                         311
           16384 |@@                                       2369
           32768 |@                                        853
           65536 |@                                        2302
          131072 |@                                        1302
          262144 |                                         518
          524288 |@                                        872
         1048576 |@                                        1172
         2097152 |@                                        1435
         4194304 |@@                                       2706
         8388608 |@@@@@@@@@@@@@@@@@@@                      29669
        16777216 |@@                                       2733
        33554432 |@@                                       3910
        67108864 |@@@@@@                                   9991
       134217728 |@                                        1800
       268435456 |@                                        840
       536870912 |                                         200
      1073741824 |                                         15
      2147483648 |                                         7
      4294967296 |                                         1
      8589934592 |                                         0

  oncpu                                               rest
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@                                        419
            2048 |@@@@@@@@@                                5730
            4096 |@                                        778
            8192 |@@@@@                                    3233
           16384 |@@                                       1400
           32768 |@@@@                                     2739
           65536 |@@@@@@@@@@@@                             7550
          131072 |@@@                                      1720
          262144 |                                         151
          524288 |                                         112
         1048576 |@@@                                      1688
         2097152 |                                         174
         4194304 |                                         86
         8388608 |@                                        411
        16777216 |                                         7
        33554432 |                                         3
        67108864 |                                         0

-- 
Mateusz Guzik <mjguzik gmail.com>