TTY task group scheduling

Sat Nov 20 14:31:09 UTC 2010

On Fri, 19 Nov 2010, Kostik Belousov wrote:

> On Fri, Nov 19, 2010 at 11:50:49AM +0200, Andriy Gapon wrote:
>> on 19/11/2010 11:46 Bruce Cran said the following:
>>> [removed current@ and stable@ from the Cc list]
>>>
>>> On Fri, 19 Nov 2010 15:41:29 +1100
>>> Andrew Reilly <areilly at bigpond.net.au> wrote:
>>>
>>>> On Linux.  Have you ever seen those sorts of UI problems on FreeBSD?

Not since FreeBSD-1 or earlier, but I don't run much bloatware.

>>>> I don't watch much video on my systems, but I haven't seen that.
>>>> FreeBSD has always been good at keeping user-interactive processes
>>>> responsive while compiles or what-not are going on in the background.
>>>
>>> I've definitely seen problems when running builds in an xterm. I've
>>> often resorted to canceling it and running it on a syscons console
>>> instead to improve performance.
>>
>> So, what was it a problem with scheduler or with, e.g., "something X"
>> being too slow rendering glyphs? Who can tell...
>
> Probably will pay a lot in negative karma by posting anything in the
> thread. But I can confirm your words, that tty->xterm->X server chain
> of output indeed significantly slows down the build processes.

I just tried a kernel build with -j256 on a 1-core system to be unreasnable,
and didn't see any sluggishness (and I notice programs taking > 10 msec to
start up), but this was under my version of 5.2 with my version of SCHED_4BSD.

> I usually never start build in the barebone xterm, always running screen
> under xterm. make -j 10 on 4 core/HTT cpu slows up to a half, from my
> unscientific impression, when run in the active screen window. Switching
> to other window in screen significantly speeds it up (note the prudent
> omission of any measured numbers).

For me, make -s -j 256 on 1 core ran at the same speed in an xterm with
another xterm watching it using top.  Without -s it took 5% longer.  The
X output is apparently quite slow.  But I rarely run X.  Syscons output
is much more efficient.

make(1) has interesting problems determining when jobs finish.  It used to
wait 10 msec (?) and that gave a lot of dead time whan 10 msec became a
long time for a process runtime.  Maybe X is interfering with its current
mechanism.

During the make -j256, the load average went up to about 100 and most
of the cc1's reached a low (numerically high) priority very quickly,
especially on the second run when the load average was high to start
(my version of the SCHED_4BSD may affect how fast or slow the priority
ramps up).  An interactive process competing with these cc1's has a very
easy time getting scheduled to run provided it is not a bloated one that
runs enough to gain a high priority itself.  If it runs as much as the
cc1's then it will become just one of 257 processes wanting to run and
it takes a very unfair scheduler to do much better than run 1 every
<quantum> (default 100 msec) and thus take a default of 25.7 seconds to
get back to the interactive one.

At least old versions of SCHED_4BSD had significant bugs that often
resulted in very unfair scheduling that happened to favour interactive
processes but sometimes went the other way.  The most interesting one
is still there :-( : sched_exit_thread() adds the child td_estcpu to
the parent td_estcpu.  Combined with the inheritance of td_estcpu on
fork(), this results in td_estcpu being exponential in the number of
reaped children, except td_estcpu is clamped to a maximum, so td_estcpu
quickly reaches the maximum td_estcpu (and td_priority quickly reaches
the minimum (numerical maximum) user priority) after a few fork/exit/waits.
The process doing the fork/waits is often a shell, and its interactivity
becomes bad when its priority becomes low.  Between about 1995 and 2000,
this bug was much worse.  Then there was no clamp, so td_estcpu was fully
exponential in the number of children, except after about 32 doublings
it overflowed to a negative value.  But before it became negative, it
became large, so its process gained the maximum priority and therefore
found it hard to run enough to create more children.  This still happens
with the clamp, but "large" is now quite small and decays after a few
seconds or minutes.  Without the clamp, the decay took minutes or hours
if not days.  The doubling is fixed in my version by setting the parent
td_estcpu to the maximum of the parent and child estcpu's on exit.  This
risks not inheriting enough (I now see a simple better method: add only
the part of the child's td_estcpu that was actually due to child activity
and is not just virtual cpu created on fork).  The doubling was originally
implemented to improve interactivity, and it "worked" bogusly by inibiting
forks.  E.g., for -j 256, it would probably stop make forking long before
it created 256 jobs.  Now with the clamp, make will just take longer to
create the 256 jobs once it increased its td_estcpu to more than that of
the first few jobs its started by reaping a few.

Well, I tried this under -current, but only have SCHED_ULE handy to test
(on a FreeBSD cluster machine).  -j256 didn't seem to be enough to cause
latency (even over the Pacific link).  Interactivity remained perfect with
-j1512.  The only noticeable difference (apart from 8 cores) in top was that
the load average barely reached 15 (instead of 100) with a couple of hundred
sleeping processes.  8 cores might do that.

Bruce