Re: ULE process to resolution

Reply: Peter 'PMc' Much: "Re: ULE process to resolution"
Reply: Mateusz Guzik : "Re: ULE process to resolution"
In reply to: Jeff Roberson : "ULE process to resolution"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Tue, 04 Apr 2023 19:24:42 UTC
Hello,

On 3/31/23, Jeff Roberson <jroberson@jroberson.net> wrote:
> As I read these threads I can state with a high degree of confidence that
> many of these tests worked with superior results with ULE at one time.
> It may be that tradeoffs have changed or exposed weaknesses, it may also
> be that it's simply been broken over time.  I see a large number of
> commits intended to address point issues and wonder whether we adequately
> explored the consquences.  Indeed I see solutions involving tunables
> proposed here that will definitively break other cases.
>

One of the reporters claims the bug they complain about was there
since early days. This made me curious how many problems reproduce on
something like 7.1 (dated 2009), to that end I created an 8 core vm
which I ran of bunch of tests on in addition to main. All 3 problems
reported below reproduced there, no X testing though :)

Bugs (one not reported in the other thread):
1. threads walking around the machine when spending little time off
cpu, all while the machine is otherwise idle

The problem with this on bare metal is that the victim cpu may be
partially powered off, so now there is latency stemming from poking it
back up, whatever other migration cost aside.

I noticed this few years back when looking at postgres -- both the
server and pgbench would walk around everywhere, reducing perf. I
checked this reproduces on fresh main. The box at hand as 2 sockets *
10 cores * 2 threads.

I *suspect* this is adequately modeled with a microbenchmark
https://github.com/antonblanchard/will-it-scale/ named
context_switch1_processes -- it too experiences all-machine walk
unless explicitly bound (pass -n to *not* bind it). I verified they
walk all around on 7.1 as well, but I don't know if postgres also
would.

how to bench:
su - postgres
/usr/local/bin/pg_ctl -D /var/db/postgres/data15 -l logfile start
pgbench -i -s 10
pgbench -M prepared -S -T 800000 -c 1 -j 1 -P1 postgres

... and you are in.

2. unfairness when oversubscribing with cpu hogs

Steve Kargl claims he reported this one numerous times since the early
days of ULE, I confirmed it was a problem on 7.1 and is a problem
today.

Say an 8 core vm (with making sure these are cores pinned on the host)

I'm going to copy paste my other message here:
I wrote a cpu burning program (memset 1 MB in a loop, with enough
iterations to take ~20 seconds on its own).

I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct cores.

The test runs *9* workers, here is a sample run:
[copy]
4bsd:
       23.18 real        20.81 user         0.00 sys
       23.26 real        20.81 user         0.00 sys
       23.30 real        20.81 user         0.00 sys
       23.34 real        20.82 user         0.00 sys
       23.41 real        20.81 user         0.00 sys
       23.41 real        20.80 user         0.00 sys
       23.42 real        20.80 user         0.00 sys
       23.53 real        20.81 user         0.00 sys
       23.60 real        20.80 user         0.00 sys
187.31s user 0.02s system 793% cpu 23.606 total

ule:
       20.67 real        20.04 user         0.00 sys
       20.97 real        20.00 user         0.00 sys
       21.45 real        20.29 user         0.00 sys
       21.51 real        20.22 user         0.00 sys
       22.77 real        20.04 user         0.00 sys
       22.78 real        20.26 user         0.00 sys
       23.42 real        20.04 user         0.00 sys
       24.07 real        20.30 user         0.00 sys
       24.46 real        20.16 user         0.00 sys
181.41s user 0.07s system 741% cpu 24.465 total
[/paste]

While ule spends fewer *cycles*, it spends more real time and it is
*probably* bad.

you can repro with:
https://people.freebsd.org/~mjg/.junk/cpuburner1.c
cc -O0 -o cpuburner1 cpuburner1.c

and a magic script:
#!/bin/sh

ins=$1

shift

while [ $ins -ne 0 ]; do
        time ./cpuburner1 $1 $2 &
        ins=$((ins-1))
done

wait

run like this, pick the second number to take 20-ish seconds on your cpu:
sh burn.sh 1048576 500000

3. threads struggling to get back on cpu against nice -n 20 higs

This acutely affects buildkernel.

I once more played around, the bug was already there in 7.1, extending
total time from ~4 minutes to 30.

The problem is introduced with the machinery to attempt to provide
fairness for pri <= PRI_MAX_BATCH. I verified that with straight up
removing all of it. Then buildikernel managed to finish in sensible
time, but the cpu hogs were overly negatively affected -- little cpu
time and very unfairly distributed between them. Key point though that
this *can* stick to close to base time.

I had seen the patch from https://reviews.freebsd.org/D15985 , it does
not fix the problem but it does alleviate it to some extent. It is
weirdly hacky and seems to be targeting just the testcase you had
instead of the more general problem.

I applied it to a 2018-ish tree so that there are no woes from rebasing.
stock:          290.95 real 2048.22 user 247.967 sys
stock+hogs:     883.81 real 2111.34 user 189.42 sys
patched+hogs:   460.84 real 2055.63 user 232.00 sys

Interestingly stock kernel from that period is less affected by the
general problem, but it is still pretty bad. With the patch things
improve markedly, but there is still ~50% increase in real time which
is way too much for being paired against -n 20.

https://people.freebsd.org/~mjg/.junk/cpuburner2.c

magic script:
#!/bin/sh

workers=$1
n=$2
size=$3
bkw=$4

echo workers $workers nice $n buildkernel $bkw

shift

while [ $workers -ne 0 ]; do
        time nice -n $n ./cpuburner $size &
        workers=$((workers-1))
done

time make -C /usr/src -ssss -j $bkw buildkernel > /dev/null

# XXX webdev-style
pkill cpuburner

wait

sample use: time sh burn+bk.sh 8 20 1048576 8

I figured there would be a regression test suite available, with tests
checking what happens for known cases with possibly contradictory
requirements. Got nothing, instead I found people use hackbench (:S)
or just a workload.

All that said, I'm buggering off the subject. My interest in it was
limited to the nice problem, since I have pretty good reasons to
suspect this is what is causing pathological total real time instances
for package builds.

Have fun,
-- 
Mateusz Guzik <mjguzik gmail.com>