Re: Periodic rant about SCHED_ULE

From: Mark Millard <>
Date: Sat, 25 Mar 2023 18:23:04 UTC
On Mar 25, 2023, at 11:14, Mark Millard <> wrote:

> Peter <> wrote on
> Date: Sat, 25 Mar 2023 15:47:42 UTC :
>> Quoting George Mitchell <>:
>>> Thank you! -- George
>> You're welcome. Can I get a success/failure report?
>> ---------------------------------------------------------------------
>>>> On 3/22/23, Steve Kargl <> wrote:
>>>>> I reported the issue with ULE some 15 to 20 years ago.
>> Can I get the PR number, please?
>> ---------------------------------------------------------------------
>> Test usecase:
>> =============
>> Create two compute tasks competing for the same -otherwise unused- core, 
>> one without, one with syscalls: 
>> # cpuset -l 13 sh -c "while true; do :; done" & 
>> # tar cvf - / | cpuset -l 13 gzip -9 > /dev/null 
>> Within a few seconds the two task are balanced, running at nearly the 
>> same PRI and using each 50% of the core: 
>> 5166 root 1 88 0 13M 3264K RUN 13 9:23 51.65% sh 
>> 10675 root 1 87 0 13M 3740K CPU13 13 1:30 48.57% gzip 
>> This changes when the tar reaches /usr/include with it's many small 
>> files. Now smaller blocks are delivered to gzip, it does more 
>> syscalls, and things get ugly: 
>> 5166 root 1 94 0 13M 3264K RUN 13 18:07 95.10% sh 
>> 19028 root 1 81 0 13M 3740K CPU13 13 1:23 4.87% gzip 
> Why did PID 10675 change to 19028?
>> This does not happen because tar would be slow in moving data to 
>> gzip: tar reads from SSD, or more likely from ARC, and this is 
>> always faster than gzip-9. The imbalance is made by the scheduler.
> When I tried that tar line, I get lots of output to stderr:
> # tar cvf - / | cpuset -l 13 gzip -9 > /dev/null
> tar: Removing leading '/' from member names
> a .
> a root
> a wrkdirs
> a bin
> a usr
> . . .
> Was that an intentional part of the test?
> To avoid this I used:
> # tar cvf - / 2>/dev/null | cpuset -l 13 gzip -9 2>&1 > /dev/null
> At which point I get the likes of:
> 17129 root          1  68    0  14192Ki    3628Ki RUN     13   0:20   3.95% gzip -9
> 17128 root          1  20    0  58300Ki   13880Ki pipdwt  18   0:00   0.27% tar cvf - / (bsdtar)
> 17097 root          1 133    0  13364Ki    3060Ki CPU13   13   8:05  95.93% sh -c while true; do :; done
> up front.
> For reference, I also see the likes of the following from
> "gstat -spod" (it is a root on ZFS context with PCIe Optane media):
> dT: 1.063s  w: 1.000s
> L(q)  ops/s    r/s     kB   kBps   ms/r    w/s     kB   kBps   ms/w    d/s     kB   kBps   ms/d    o/s   ms/o   %busy Name
> . . .
>    0     68     68     14    937    0.0      0      0      0    0.0      0      0      0    0.0      0    0.0    0.1| nvd2
> . . .

I left it running and I'm now seeing:

17129 root          1 107    0  14192Ki    3628Ki CPU13   13   3:01  48.10% gzip -9
17128 root          1  21    0  58300Ki   15428Ki pipdwt  20   0:04   2.02% tar cvf - / (bsdtar)
17097 root          1 115    0  13364Ki    3060Ki RUN     13  16:30  51.77% sh -c while true; do :; done

Also examples of the likes of:

dT: 1.063s  w: 1.000s
L(q)  ops/s    r/s     kB   kBps   ms/r    w/s     kB   kBps   ms/w    d/s     kB   kBps   ms/d    o/s   ms/o   %busy Name
. . .
    0   1213   1213      5   6456    0.0      0      0      0    0.0      0      0      0    0.0      0    0.0    1.2| nvd2
. . .

FYI: ThreadRipper 1950X context.

Looks like what I'll see is very dependent on when I
look at what it is doing: the details involved matter.

Mark Millard
marklmi at