Re: Periodic rant about SCHED_ULE

From: Ian Lepore <ian_at_freebsd.org>
Date: Tue, 13 Jul 2021 22:22:05 UTC
On Tue, 2021-07-13 at 18:09 -0400, Zaphod Beeblebrox wrote:
> I opened https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257160
> regarding
> the following:
> 
> SCHED_4BSD seems subject to a bit of rot at this point.  To Wit, my 4
> core
> riscv64 platform recently showed this top while doing a make -j4 of
> my own
> code.  Note that each of the processes using more than 1000% CPU are
> single-threaded.
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES
> STATE    C   TIME    WCPU COMMAND
>   604 dgilbert      1  45    0   109M    66M CPU3     3   0:02
> 1039.89% c++
>   605 dgilbert      1  45    0   109M    66M CPU1     1   0:02
> 1031.29% c++
>   606 dgilbert      1  45    0   109M    66M RUN      2   0:02
> 1020.32% c++
>   603 dgilbert      1  44    0   109M    66M CPU0     0   0:02
> 1011.41% c++
>   854 root          1  40    0    17M  4764K
> select   1   3:04   0.17% tmux
>   425 root          1  40    0    14M  4040K
> CPU2     2   0:03   0.15% top
> 
> As I said there, I don't believe that this is RISCV64 related --- it
> seems
> to me that the data that top is pulling is either incorrect or top is
> interpreting it incorrectly.  The WCPU value seems to asymptotically
> approach 100%, but I'm not sure of that --- I can only watch it for
> so
> long.  The same behaviour is seen if you launch (while true; do true;
> done)
> & in the background.
> 
> But OTOH, if you are running SCHED_ULE, and you launch two of those
> while
> true's at nice -20 for each cpu ... then launch one at nice '0' ...
> you'll
> find that the nice 0 process fails to get 100% cpu.  To my mind, this
> is a
> failure of the scheduler to read my intentions of nice -20.  In fact,
> at
> times, the processor share of the un-nice process will fall below
> some of
> the nice processes for a few dozen samples at a time.  Here is a top
> displaying that brokenness...
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES
> STATE    C   TIME    WCPU COMMAND
> 36410 root          1  89    0    14M   796K
> RUN      3   0:18  54.31% bash
> 36370 root          1 106   20    14M   800K
> RUN      1   0:58  49.86% bash
> 36372 root          1 105   20    14M   800K
> CPU1     1   0:56  49.69% bash
> 36375 root          1 106   20    14M   800K
> RUN      0   0:57  46.37% bash
> 36373 root          1 103   20    14M   800K
> RUN      3   0:56  44.94% bash
> 36371 root          1 105   20    14M   800K
> CPU0     0   0:57  43.51% bash
> 36376 root          1 105   20    14M   800K
> RUN      2   0:59  38.76% bash
> 36369 root          1 104   20    14M   920K
> CPU2     2   0:57  37.61% bash
> 36374 root          1 104   20    14M   800K
> RUN      2   0:57  32.66% bash
> 
> TBH, I think SCHED_ULE is a failure and the only reason more people
> don't
> think so is that processors are now laregely too fast for people to
> care.
> Most people don't notice the scheduler because they almost never have
> more
> tasks than processor threads, so even really dumb schedulers would
> work out
> "OK" 98% of the time.
> 
> I know we don't have guiding principles for nice, but I would toss
> out the
> +/- five rule for it --- that any process more than 5 nice levels
> lower
> from a cpu-busy process shouldn't preempt the higher process.  I
> realize we
> have rtprio, but it's a pain to use.  Anyways, don't let this last
> comment
> distract.
> 
> 
> 
> On Thu, Jul 8, 2021 at 3:20 AM Rozhuk Ivan <rozhuk.im@gmail.com>
> wrote:
> 
> > On Wed, 7 Jul 2021 13:47:47 -0400
> > George Mitchell <george+freebsd@m5p.com> wrote:
> > 
> > > CPU: AMD Ryzen 5 2600X Six-Core Processor (3600.10-MHz K8-class
> > > CPU)
> > > (12 threads).
> > > 
> > > FreeBSD 12.2-RELEASE-p7 r369865 GENERIC  amd64 (SCHED_ULE) vs
> > > FreeBSD 12.2-RELEASE-p7 r369865 M5P  amd64 (SCHED_4BSD).
> > > 
> > > Comparing "make buildworld" time with misc/dnetc running vs not
> > > running. (misc/dnetc is your basic 100% compute-bound task,
> > > running
> > > at nice 20.)
> > > 
> > > Three out of the four combinations build in roughly four hours,
> > > but
> > > SCHED_ULE with dnetc running takes close to twelve!  (And that
> > > was
> > > overnight with basically nothing else running.)  This is an even
> > > worse disparity than I have seen in previous releases.
> > 
> > I do not use dnetc, but shed_ule on 2700 compile wold faster than 4
> > hours.
> > With ccache it takes ~10 minutes: world+kernel build and install
> > and
> > update loaders.
> > 
> > 
> > # Make an SMP-capable kernel by default
> > options         SMP                     #b Symmetric MultiProcessor
> > Kernel
> > options         NUMA                    #o Non-Uniform Memory
> > Architecture
> > support
> > options         EARLY_AP_STARTUP        #o
> > 
> > device          cpufreq                 #m for non-ACPI CPU
> > frequency
> > control
> > device          cpuctl                  #m Provides access to MSRs,
> > CPUID
> > info and microcode update feature.
> > 
> > 
> > # Kernel base
> > options         SCHED_ULE               #b 4BSD/ULE scheduler
> > options         _KPOSIX_PRIORITY_SCHEDULING #b POSIX P1003_1B real-
> > time
> > extensions
> > options         PREEMPTION              #b Enable kernel thread
> > preemption
> > 
> > 
> > and sysctl tunings on desktop only:
> > # SCHEDULER
> > kern.sched.steal_thresh=1               # Minimum load on remote
> > CPU
> > before we'll steal // workaround for freezes
> > kern.sched.balance=0                    # Enables the long-term
> > load
> > balancer
> > kern.sched.balance_interval=1000        # Average period in stathz
> > ticks
> > to run the long-term balancer
> > kern.sched.affinity=10000               # Number of hz ticks to
> > keep
> > thread affinity for
> > 
> > 
> > 
> > 

top has been showing bad values for CPU% with SCHED_BSD for many years,
on all architectures.  I remember Bruce Evans once commenting that it
had something to do with changes to clock handling in the kernel (maybe
related to when eventtimers first came in, but I might be misrembering
that detail).  If you ask top to display straight cpu instead of wcpu
the results are much more sane.

I too wish that nice made a bigger difference, but that problem isn't
limited to SCHED_ULE, nice is little more than a vague hint even when
using SCHED_BSD.  I eventually concluded that there's just no way to
run a compute-heavy workload (such as buildworld -j<ncpu>) using nice
and keep the machine responsive enough for interactive use.  I switched
to running builds with idprio, which isn't really onerous if you set
sysctl security.bsd.unprivileged_idprio=1 in /etc/sysctl.conf.

-- Ian