Re: Periodic rant about SCHED_ULE

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 25 Mar 2023 22:35:36 UTC
> On Mar 25, 2023, at 14:51, Peter <pmc@citylink.dinoex.sub.org> wrote:
> 
> On Sat, Mar 25, 2023 at 01:41:16PM -0700, Mark Millard wrote:
> ! On Mar 25, 2023, at 11:58, Peter <pmc@citylink.dinoex.sub.org> wrote:
> 
> ! > ! 
> ! > ! At which point I get the likes of:
> ! > ! 
> ! > ! 17129 root          1  68    0  14192Ki    3628Ki RUN     13   0:20   3.95% gzip -9
> ! > ! 17128 root          1  20    0  58300Ki   13880Ki pipdwt  18   0:00   0.27% tar cvf - / (bsdtar)
> ! > ! 17097 root          1 133    0  13364Ki    3060Ki CPU13   13   8:05  95.93% sh -c while true; do :; done
> ! > ! 
> ! > ! up front.
> ! > 
> ! > Ah. So? To me this doesn't look good. If both jobs are runnable, they
> ! > should each get ~50%.
> ! > 
> ! > ! For reference, I also see the likes of the following from
> ! > ! "gstat -spod" (it is a root on ZFS context with PCIe Optane media):
> ! > 
> ! > So we might assume that indeed both jobs are runable, and the only
> ! > significant difference is that one does system calls while the other
> ! > doesn't.
> ! > 
> ! > The point of this all is: identify the malfunction with the most
> ! > simple usecase. (And for me here is a malfunction.)
> ! > And then, obviousely, fix it.
> ! 
> ! I tried the following that still involves pipe-io but avoids
> ! file system I/O (so: simplifying even more):
> ! 
> ! cat /dev/random | cpuset -l 13 gzip -9 >/dev/null 2>&1
> ! 
> ! mixed with:
> ! 
> ! cpuset -l 13 sh -c "while true; do :; done" &
> ! 
> ! So far what I've observed is just the likes of:
> ! 
> ! 17736 root          1 112    0  13364Ki    3048Ki RUN     13   2:03  53.15% sh -c while true; do :; done
> ! 17735 root          1 111    0  14192Ki    3676Ki CPU13   13   2:20  46.84% gzip -9
> ! 17734 root          1  23    0  12704Ki    2364Ki pipewr  24   0:14   4.81% cat /dev/random
> ! 
> ! Simplifying this much seems to get a different result.
> 
> Okay, then you have simplified too much and the malfunction is not
> visible anymore.
> 
> ! Pipe I/O of itself does not appear to lead to the
> ! behavior you are worried about.
> 
> How many bytes does /dev/random deliver in a single read() ?
> 
> ! Trying cat /dev/zero instead ends up similar:
> ! 
> ! 17778 root          1 111    0  14192Ki    3672Ki CPU13   13   0:20  51.11% gzip -9
> ! 17777 root          1  24    0  12704Ki    2364Ki pipewr  30   0:02   5.77% cat /dev/zero
> ! 17736 root          1 112    0  13364Ki    3048Ki RUN     13   6:36  48.89% sh -c while true; do :; done
> ! 
> ! It seems that, compared to using tar and a file system, there
> ! is some significant difference in context that leads to the
> ! behavioral difference. It would probably be of interest to know
> ! what the distinction(s) are in order to have a clue how to
> ! interpret the results.
> 
> I can tell you:
> With tar, tar can likely not output data from more than one input
> file in a single output write(). So, when reading big files, we
> get probably 16k or more per system call over the pipe. But if the
> files are significantly smaller than that (e.g. in /usr/include),
> then we get gzip doing more system calls per time unit. And that
> makes a difference, because a system call goes into the scheduler
> and reschedules the thread.
> 
> This 95% vs. 5% imbalance is the actual problem that has to be
> addressed, because this is not suitable for me, I cannot wait for my
> tasks starving along at a tenth of the expected compute only because
> some number crunching does also run on the core.
> 
> Now, reading from /dev/random cannot reproduce it. Reading from
> tar can reproduce it under certain conditions - and that is all that
> is needed.

The suggestion that the size of the transfers into the
first pipe matters is back up by experiments with the
likes of:

dd if=/dev/zero bs=128 | cpuset -l 13 gzip -9 >/dev/null 2>&1 &
vs.
dd if=/dev/zero bs=132 | cpuset -l 13 gzip -9 >/dev/null 2>&1 &
vs.
dd if=/dev/zero bs=133 | cpuset -l 13 gzip -9 >/dev/null 2>&1 &
vs.
dd if=/dev/zero bs=192 | cpuset -l 13 gzip -9 >/dev/null 2>&1 &
vs.
dd if=/dev/zero bs=1k | cpuset -l 13 gzip -9 >/dev/null 2>&1 &
vs.
dd if=/dev/zero bs=4k | cpuset -l 13 gzip -9 >/dev/null 2>&1 &
vs.
dd if=/dev/zero bs=16k | cpuset -l 13 gzip -9 >/dev/null 2>&1 &

(just examples) as what is paired up with:

cpuset -l 13 sh -c "while true; do :; done" &

Such avoids the uncontrolled variability in use of tar
against a file system.

But an interesting comparison/contrast results from, for
example:

dd if=/dev/zero bs=128 | cpuset -l 13 gzip -9 >/dev/null 2>&1 &
vs.
dd if=/dev/random bs=128 | cpuset -l 13 gzip -9 >/dev/null 2>&1 &

as what is paired with the: cpuset -l 13 sh -c "while true; do :; done" &

At least in my context, the /dev/zero one ends up with:

18251 root          1  68    0  14192Ki    3676Ki RUN     13   0:02   1.07% gzip -9
18250 root          1  20    0  12820Ki    2484Ki pipewr  29   0:02   1.00% dd if=/dev/zero bs=128
18177 root          1 135    0  13364Ki    3048Ki CPU13   13  14:47  98.93% sh -c while true; do :; done

but the /dev/random one ends up with:

18253 root          1 108    0  14192Ki    3676Ki CPU13   13   0:09  50.74% gzip -9
18252 root          1  36    0  12820Ki    2488Ki pipewr  30   0:03  16.96% dd if=/dev/random bs=128
18177 root          1 115    0  13364Ki    3048Ki RUN     13  15:45  49.26% sh -c while true; do :; done

It appears that the CPU time (or more) for the dd feeding the
first pipe matters for the overall result, not just the
bs= value used.


===
Mark Millard
marklmi at yahoo.com