Re: Periodic rant about SCHED_ULE

From: Steve Kargl <sgk_at_troutmask.apl.washington.edu>
Date: Fri, 24 Mar 2023 20:25:16 UTC
On Fri, Mar 24, 2023 at 12:47:08PM -0700, Mark Millard wrote:
> Steve Kargl <sgk_at_troutmask.apl.washington.edu> wrote on
> Date: Wed, 22 Mar 2023 19:04:06 UTC :
> 
> > I reported the issue with ULE some 15 to 20 years ago.
> > I gave up reporting the issue. The individuals with the
> > requisite skills to hack on ULE did not; and yes, I lack
> > those skills. The path of least resistance is to use
> > 4BSD.
> > 
> > % cat a.f90
> > !
> > ! Silly numerically intensive computation.
> > !
> > program foo
> > implicit none
> > integer, parameter :: m = 200, n = 1000, dp = kind(1.d0)
> > integer i
> > real(dp) x
> > real(dp), allocatable :: a(:,:), b(:,:), c(:,:)
> > call random_init(.true., .true.)
> > allocate(a(n,n), b(n,n))
> > do i = 1, m
> > call random_number(a)
> > call random_number(b)
> > c = matmul(a,b)
> > x = sum(c)
> > if (x < 0) stop 'Whoops'
> > end do
> > end program foo
> > % gfortran11 -o z -O3 -march=native a.f90 
> > % time ./z
> > 42.16 real 42.04 user 0.09 sys
> > % cat foo
> > #! /bin/csh
> > #
> > # Launch NCPU+1 images with a 1 second delay
> > #
> > foreach i (1 2 3 4 5 6 7 8 9)
> > ./z &
> > sleep 1
> > end
> > % ./foo
> > 
> > In another xterm, you can watch the 9 images.
> > 
> > % top
> > st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46 11:43:01
> > 74 processes: 10 running, 64 sleeping
> > CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle
> > Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G Free
> > Swap: 16G Total, 16G Free
> > 
> > PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND
> > 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% z
> > 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% z
> > 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% z
> > 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% z
> > 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% z
> > 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% z
> > 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% z
> > 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% z
> > 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% z
> > 
> > With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's exit
> > after 55-ish seconds. If you try this experiment on ULE, you'll get NCPU-1
> > ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the
> > two images ping-pong on one cpu. Back when I was testing ULE vs 4BSD,
> > this was/is due to ULE's cpu affinity where processes never migrate to
> > another cpu. Admittedly, this was several years ago. Maybe ULE has
> > gotten better, but George's rant seems to suggest otherwise.
> 
> Note: I'm only beginning to explore your report/case.
> 
> There is a significant difference in your report and
> George's report: his is tied to nice use (and I've
> replicated there being SCHED_4BSD vs. SCHED_ULE
> consequences in the same direction George reports
> but with much larger process counts involved). In
> those types of experiments, I without the nice use
> I did not find notable differences. But it is a
> rather different context than your examples. Thus
> the below as a start on separate experiments closer
> to what you report using.

Yes, I recognizes George's case is different.  However,
the common problem is ULE.  My testcase shows/suggests
that ULE is unsuitable for a HPC platform.

> Not (yet) having a Fortran set up I did some simple
> expriments with stress --cpu N (N processes looping
> sqrt calculations) and top. In top I sorted by pid
> to make it obvious if a fixed process was getting a
> fixed CPU or WCPU. (I tried looking at both CPU and
> WCPU, varying the time between samples as well. I
> also varied stress's --backoff N . This was on a
> ThreadRipper 1950X (32 hardware threads, so 16 cores)
> that was running:

You only need a numerically intensive program that runs
for 30-45 seconds.  I use Fortran everyday and wrote the
silly example in 5 minutes.  The matrix multiplication
of two 1000x1000 double precision matrices has two
benefits with this synthetic benchmark.  It takes 40-ish
seconds on my hardware (AMD FX-8350) and it blows out the
cpu cache.

> This seems at least suggestive that, in my context, the
> specific old behavior that you report does not show up,
> at least on the timescales that I was observing at. It
> still might not be something you would find appropriate,
> but its does appear to at least be different.
> 
> There is the possibility that stress --cpu N leads to
> more being involved than I expect and that such is
> contributing to the behavior that I've observed.

I can repeat the openmpi testing, but it will have to 
wait for a few weeks as I have a pressing deadline.
The openmpi program is a classic boss-worker scenario
(and an almost perfectly parallel application with litttle
communication overhead).  boss starts and initializes the
environment and then launches numerical intensive 
workers.  If boss+n workers > ncpu, you get a boss and
a worker pinned to a cpu.  If boss and worker ping-pong,
it stalls the entire program.

Admittedly, I last tested years ago.  ULE may have had
improvements.

-- 
Steve