Re: Periodic rant about SCHED_ULE

From: Zaphod Beeblebrox <zbeeble_at_gmail.com>
Date: Tue, 13 Jul 2021 22:09:27 UTC
I opened https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257160 regarding
the following:

SCHED_4BSD seems subject to a bit of rot at this point.  To Wit, my 4 core
riscv64 platform recently showed this top while doing a make -j4 of my own
code.  Note that each of the processes using more than 1000% CPU are
single-threaded.

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
  604 dgilbert      1  45    0   109M    66M CPU3     3   0:02 1039.89% c++
  605 dgilbert      1  45    0   109M    66M CPU1     1   0:02 1031.29% c++
  606 dgilbert      1  45    0   109M    66M RUN      2   0:02 1020.32% c++
  603 dgilbert      1  44    0   109M    66M CPU0     0   0:02 1011.41% c++
  854 root          1  40    0    17M  4764K select   1   3:04   0.17% tmux
  425 root          1  40    0    14M  4040K CPU2     2   0:03   0.15% top

As I said there, I don't believe that this is RISCV64 related --- it seems
to me that the data that top is pulling is either incorrect or top is
interpreting it incorrectly.  The WCPU value seems to asymptotically
approach 100%, but I'm not sure of that --- I can only watch it for so
long.  The same behaviour is seen if you launch (while true; do true; done)
& in the background.

But OTOH, if you are running SCHED_ULE, and you launch two of those while
true's at nice -20 for each cpu ... then launch one at nice '0' ... you'll
find that the nice 0 process fails to get 100% cpu.  To my mind, this is a
failure of the scheduler to read my intentions of nice -20.  In fact, at
times, the processor share of the un-nice process will fall below some of
the nice processes for a few dozen samples at a time.  Here is a top
displaying that brokenness...

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
36410 root          1  89    0    14M   796K RUN      3   0:18  54.31% bash
36370 root          1 106   20    14M   800K RUN      1   0:58  49.86% bash
36372 root          1 105   20    14M   800K CPU1     1   0:56  49.69% bash
36375 root          1 106   20    14M   800K RUN      0   0:57  46.37% bash
36373 root          1 103   20    14M   800K RUN      3   0:56  44.94% bash
36371 root          1 105   20    14M   800K CPU0     0   0:57  43.51% bash
36376 root          1 105   20    14M   800K RUN      2   0:59  38.76% bash
36369 root          1 104   20    14M   920K CPU2     2   0:57  37.61% bash
36374 root          1 104   20    14M   800K RUN      2   0:57  32.66% bash

TBH, I think SCHED_ULE is a failure and the only reason more people don't
think so is that processors are now laregely too fast for people to care.
Most people don't notice the scheduler because they almost never have more
tasks than processor threads, so even really dumb schedulers would work out
"OK" 98% of the time.

I know we don't have guiding principles for nice, but I would toss out the
+/- five rule for it --- that any process more than 5 nice levels lower
from a cpu-busy process shouldn't preempt the higher process.  I realize we
have rtprio, but it's a pain to use.  Anyways, don't let this last comment
distract.



On Thu, Jul 8, 2021 at 3:20 AM Rozhuk Ivan <rozhuk.im@gmail.com> wrote:

> On Wed, 7 Jul 2021 13:47:47 -0400
> George Mitchell <george+freebsd@m5p.com> wrote:
>
> > CPU: AMD Ryzen 5 2600X Six-Core Processor (3600.10-MHz K8-class CPU)
> > (12 threads).
> >
> > FreeBSD 12.2-RELEASE-p7 r369865 GENERIC  amd64 (SCHED_ULE) vs
> > FreeBSD 12.2-RELEASE-p7 r369865 M5P  amd64 (SCHED_4BSD).
> >
> > Comparing "make buildworld" time with misc/dnetc running vs not
> > running. (misc/dnetc is your basic 100% compute-bound task, running
> > at nice 20.)
> >
> > Three out of the four combinations build in roughly four hours, but
> > SCHED_ULE with dnetc running takes close to twelve!  (And that was
> > overnight with basically nothing else running.)  This is an even
> > worse disparity than I have seen in previous releases.
>
> I do not use dnetc, but shed_ule on 2700 compile wold faster than 4 hours.
> With ccache it takes ~10 minutes: world+kernel build and install and
> update loaders.
>
>
> # Make an SMP-capable kernel by default
> options         SMP                     #b Symmetric MultiProcessor Kernel
> options         NUMA                    #o Non-Uniform Memory Architecture
> support
> options         EARLY_AP_STARTUP        #o
>
> device          cpufreq                 #m for non-ACPI CPU frequency
> control
> device          cpuctl                  #m Provides access to MSRs, CPUID
> info and microcode update feature.
>
>
> # Kernel base
> options         SCHED_ULE               #b 4BSD/ULE scheduler
> options         _KPOSIX_PRIORITY_SCHEDULING #b POSIX P1003_1B real-time
> extensions
> options         PREEMPTION              #b Enable kernel thread preemption
>
>
> and sysctl tunings on desktop only:
> # SCHEDULER
> kern.sched.steal_thresh=1               # Minimum load on remote CPU
> before we'll steal // workaround for freezes
> kern.sched.balance=0                    # Enables the long-term load
> balancer
> kern.sched.balance_interval=1000        # Average period in stathz ticks
> to run the long-term balancer
> kern.sched.affinity=10000               # Number of hz ticks to keep
> thread affinity for
>
>
>
>