cvs commit: src/sys/kern sched_ule.c

Tue Oct 2 13:37:15 PDT 2007

On Tue, 2 Oct 2007, Bruce Evans wrote:

> On Mon, 1 Oct 2007, Jeff Roberson wrote:
>
>> On Tue, 2 Oct 2007, Bruce Evans wrote:
>
>>> Further testing of my ~4BSD scheduler in ~5.2 indicates that when a
>>> process wants less than about 1/loadavg of the CPU on average, it
>>> usually just gets it, with no scheduling delays, since it usually has
>>> higher priority than all other user processes.  Otherwise, the worst-case
>>> scheduling delays increase from ~10 msec to ~2 seconds.  It is easy
>>> to reduce the scheduling quantum from its default of 100 msec by a
>>> factor of 100, but this doesn't seem to work right.  So the behaviour
>>> is very dependent on the load and on the amount of CPU wanted by the
>>> interactive process.
>
> [Read the middle of this bloated mail, about debugging ULE, first.]
>
> This is only for my ~5.2 etc. with the queuing hacked backed out.  I
> think real 5.2 and 4.x act similarly, except at least 4.x has a bad
> policy for priority inheritance on fork/exit which can cause the
> priority to grow exponentially in the number of descendants (except
> it is clamped to a maximum, so the growth is just nonlinear and breaks
> various things when the limit is reached.  I tested a 4.10 kernel a bit
> today but didn't have enough 4.x utilities in my userland to see what
> it is doing.
>
> -current with 4BSD is much worse than this.  I observed a worst-case
> scheduling delay of > 26 seconds.  Mouse movements are jerky.
>
> -current with ULE, after debugging the configuration, is slightly worse
> than this.  Mouse movements aren't jerky.  But ULE seems to often
> mispredict when a process is interactive, and it sometimes gets into
> a state where one process (not an interactive one) is given 100% CPU
> for too long while many other processes are runnable.

Bruce,

Sorry I don't have time for a point by point on this one.  Thank you for 
your interesting analysis.  From this I'm taking away a couple of things:

1)  I've noticed that ULE relied on PREEMPTION for a long time and lost 
the NEEDRESCHED setting in cases where it doesn't set owepreempt. 
Restoring this should improve some of the !PREEMPTION behavior and perhaps 
even responsiveness in your nice tests.

2)  I need to try running with hz = 100 and see if there are some scaling 
problems.  I have heard reports that ULE scales better than 4BSD up to 
higher hz values but I haven't investigated this much.  It should work 
with lower as well.  Everything important to relative priorites and time 
slice allotment runs off of stathz.

3)  The code which adjusts priorities for fork may need some more fine 
tuning.  ULE agressively penalizes parents for forking expensive children. 
This helps us learn that make should not create interactive children for 
example.

4)  I don't think you're losing interrupts when you ctrl+c.  It's just 
taking too long for the interrupted task to run.  ctrl+z takes effect 
immediately when the signal is delivered.  This may be related to hz = 100 
or running without preemption.  I am not able to reproduce this problem 
with a standard GENERIC kernel + ULE.

I will look into these issues soon.

Thanks,
Jeff

>
>>> ...
>>> 
>>> I now have more experience with ULE.  A version built today gave
>>> dramatically worse interactivity, so much so that I think it must have
>>> been broken recently.  A simple shell loop hangs the rest of the system
>>> in some cases, and a background build has similar bad effects, probably
>>> limited mainly by useful loops not being endless.
>> 
>> I'm not able to reproduce this and no one else has reported it.
>
> This always happens with hz = 100.  Reducing preempt_thresh to below
> about 50 mostly fixes the problem, and reducing the threshold to 0
> fixes the problem a bit more.  The shell loop processes still take too
> long to start up (often several seconds for just 20), but the second
> process starts within a second, instead of showing signs of taking
> forever to start up.  Apparently, in the broken case, an IPI to stop
> the first process is never delivered.  ^Z works to stop the whole
> process group, and then two %'s to usually result in proceeding to
> the next process.  Having to use two %'s is strange but may be just
> a shell bug.
>
> -current with 4BSD also takes too long to start all the processes,
> while ~5.2 restarts them all apparently-instantly.  In fact it starts
> them too fast and runs into the old exec resource shortage bug after
> 16 processes and 3 or 4 or the starts fail in exec.
>
> With hz = 1000 and ULE, the default preempt_thresh of 64 works but
> reducing it to 0 works better.  Startup is still too slow.
>
> Apparently, there is a scaling bug for hz or extra interrupts for
> the larger hz help, and the default preempt_thresh is not best.
>
> I saw this behaviour for 2 different kernels:
> - SMP kernel (all this is running on an A64 UP in i386 mode) built on
>  Aug 5.  Timer interrupts were via the APIC.  hz was set to 100 at
>  boot time.  stathz was always 100 and in perfect sync with hz.
>  (Plain current with APIC timer interrupts gives a broken stathz of
>  13 when hz is 100, and stathz in bogus sync with hz.)
> - UP kernel built today.  Timer interrupts were via the i8254 and the
>  RTC.  hz was set to 100 or 1000 at boot time.  stathz was always
>  128.  The different interrupt configuration and timing (except for
>  increasing hz for ULE) made little difference.
>
> The SMP kernel got a bit further in the shell loop startup when hz = 100
> but otherwise behaved similarly.
>
>> This may be the result of some incompatibility between bdebsd and ULE.
>
> Nah, I don't use ULE in bdebsd (except all userland is bdebsd), and
> I don't touch schedulers in -current (I mainly touch filesystems and
> network drivers).  Current kernels are remarkably compatible with
> old userlands.
>
>> Is this a SMP machine?  Do you have PREEMPTION enabled?  ULE recently 
>> started honoring preemption.  Try setting:
>
> See above.  Always PREEMPTION for UP, since without it problems like the
> above are almost to be expected.  I think 5.2 has them.  ~5.2 preempts
> a lot as a side effect of switching context for clock interrupt handlers
> and then (without the queueing hack) rescheduling on switching back.
>
>> kern.sched.preempt_thresh: 64
>
> But this setting is part of the problem.
>
>> if it is not already.  I know you deal with hardclock differently. Without 
>> PREEMPTION it may not work correctly.
>
> No, the difference for hardclock is not in ULE kernels.
>
>>> First I tried an old regression test for nice[1-2]:
>>> 
>>> %%%
>>> for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>>> do
>>>    nice -$i sh -c "while :; do echo -n;done" &
>>> done
>>> top -o time
>>> %%%
>> 
>> I use this:
>> for i in -20 -16 -12 -8 -4 0 4 8 12 16 20
>> do
>>        nice -$i sh -c "while :; do echo -n;done" &
>> done
>> top -o time
>> 
>> I like to verify that the distribution doesn't get out of whack.  It takes
>
> Then non-multiple-of-4 entries in my list are almost useless.  I mostly
> use the [0-20] list because it is in the first file in a test directory
> and doesn't have any negative values so it doesn't need privilege to run.
>
>> some time to settle before the higher nice threads get enough runtime to 
>> sort properly.  My results are as so:
>
> The settling time/inertia is both a bug and a feature.  It's good to have
> inertia for long-running processes, but makeworld can start several hundred
> processes per second and finish many of them so there is nowhere near
> enough settling time for these processes so their behaviour is hard to
> predict.
>
>>  868 root          1  81  -20  3492K  1404K RUN      0:28 23.58% sh
>>  869 root          1  83  -16  3492K  1404K RUN      0:20 15.09% sh
>>  870 root          1  86  -12  3492K  1404K RUN      0:16 12.16% sh
>>  871 root          1  90   -8  3492K  1404K RUN      0:12  8.89% sh
>>  872 root          1  93   -4  3492K  1404K RUN      0:11  7.96% sh
>>  873 root          1  97    0  3492K  1404K RUN      0:09  6.59% sh
>>  874 root          1 101    4  3492K  1404K RUN      0:08  4.88% sh
>>  875 root          1 105    8  3492K  1404K RUN      0:07  5.37% sh
>>  876 root          1 109   12  3492K  1404K RUN      0:06  3.37% sh
>>  877 root          1 113   16  3492K  1404K RUN      0:06  4.05% sh
>>  878 root          1 116   20  3492K  1404K RUN      0:05  3.96% sh
>> 
>> Really might not be enough difference with positive nice values.  I've 
>> never really had a good feeling about how nice should really behave but 
>> this mostly seems reasonable.  It would be possible to tweak the algorithm 
>> to further penalize nice.
>
> I still use a table-driven algorithm with weights 2**(nice_value/4).  This
> gives a dynamic range of a factor 1024.
>
>>> This hung after starting only about one of the shell processes.  After
>>> cutting the list down to just one process with nice -20, it still hung.
>>> Shells on other syscons terminals running at rtprio 0 could not compete
>>> with the nice -20 process:
>>> - they could not start top to look at what was happening
>>> - an already-running could not display anything new
>>> - they could not start killall.
>>> With the list cut down to about 6 processes, ps in ddb showed evidence of 
>>> all the processes starting, and I was able to kill them all using
>>> kill in ddb.
>
> Fixed using larger hz and/or smaller preempt_thresh; ddb wasn't necessary
> since ^Z worked (if hit it before ^C?) -- see above.
>
>>> [hz = 100 case not so bad]
>
> Other stange behaviour with preempt_thresh = 64, at least with hz = 100:
> start two identical CPU hogs, each with a runtime of 2.5 seconds, on
> separate consoles.  Then one is given 100% of the CPU until it completes,
> and it is always the second one started that gets 100% CPU first.  Thus
> the first one started takes about 5.0 seconds to complete and the second
> one started takes about 2.5 seconds to complete.
>
>>> Running makeworld with just -j4 n the background gives similar symptoms.
>>> When a new process is started, it sometimes gets too many cycles to
>>> begin with, and apparently completely stops all processes in the
>>> makeworld (but not the top displaying things) for several seconds.
>>> After a while (I guess when the interactivity score descreases), this
>>> behaviour changes to giving the new process very few cycles even if
>>> it is semi-interactive (a foreground process started from a shell).
>
> ~5.2 behaves similarly, but I think a little better.  In ~5.2 (and
> maybe in all schedulers), the initial priority is just a function of
> the parent's priority (I use a simple function that might be slightly
> different from 5.2.  I forget what it is).  If neither the parent nor
> the child runs for long, then new processes tend to get almost all the
> CPU until they run for too long.  When the children exit, the parent
> inherits some priority according to another simple function.  ~5.2
> works best here since it uses better functions than 5.2 does (much
> better than the exponential functions in 4.x), and it keeps track of
> history better than ULE can.
>
> I tested this mainly using:
>
> 	time /tmp/q1 & time /tmp/q1 & acroread *pdf   # type ^q to exit 
> acroread
>
> where /tmp/q1 measures latency by calling clock_gettime() in a loop and
> there are 12 pdf files of total size 4.75MB.  acroread is sufficiently
> bloated and hoggish to have very bad behaviour here.  The results when
> this is run on an xterm that has initially been idle for some time (or
> is in some more magic state for ULE interactivity?) at loadavg 20 are
> approximately:
>
> 	all: acroread starts fast for the first few runs (would be ~ 1
> 	    seconds with no load; this only increases by a second or two)
>
> 	    /tmp/q1 runs for ~2.5 seconds self time and shows low max
> 	    latency (would be ~ 200 usec with no load; this increases to
> 	    ~10 msec; both high variance)
> 	~5.2-4BSD: after a few runs, the parent priority becomes near the
> 	    max so further runs take 5-10 seconds to start.  20 seconds at
> 	    a load avg of 20 would be fairer, but the parent priority
> 	    doesn't get as near the max as background hog's priorities.
>
> 	    After a few runs, max latency is usually 100-500 msec and was
> 	    once 2 seconds.
>
> 	    Latency in mouse movements is not noticable
> 	current-4BSD: further runs don't take much longer to start.
> 	    Apparently the parent doesn't inherit enough priority.
> 	    (In 4.2 it inherited far too much.)
>
> 	    After a few runs, max latency is usually 1-2 seconds and was
> 	    once 27 seconds.
>
> 	    The latency of 1-2 is often noticeable for mouse movements and
> 	    even for echo in xterms.
> 	current-ULE: further runs sometimes take _much_ longer, a minute
> 	    or so, and there is a high variance in the length.
>
> 	    After a few runs, max latency is usually a few hundred msec
> 	    larger than for ~5.2.
>
> 	    Latency in mouse movements is not noticable
>
>
>>> In at least this phase, ^C to kill processes doesn't work, but ^Z to
>>> suspend them and then kill from the shell works normally, and 
>>> interactivity
>>> in not-very-bloated mail programs and editors is very bad.  A
>
> ^C fails only in the phase where hz is small, preempt_thresh is larger,
> and (?) the parent hasn't gained much priority and/or (negative?)
> interactivity.
>
>>> Other behaviour with 4BSD schedulers and various kernels:
>>> - the max scheduling delay is almost independent of the CPU speed.
>
> This may be because it is just a function of the priorities which are
> mainly a function of the algorithm.
>
>>> - the max scheduling delay is slightly worse for -current with 4BSD
>>>  than with my ~5.2.
>
> Acually, it is much worse.
>
>>> - -current has anomalous behaviour relative to ~5.2 for background
>>>  makeworld -j16: many fewer runnable processes, a much smaller max
>>>  load average, and many more zombies visible when top looks.
>
> This may be related to the slow startup of the shell loops and caused by
> the priority inheritance for fork/exit.
>
>>> - [queue hack]
>>>  ...
>>>  essentially roundrobin scheduling under loads that generate lots
>>>  of interrupts.  Interactivity is still poor because makeworld
>>>  sometimes generates a few hundred processes per second and cycling
>>>  through that many takes a long time even with a tiny quantum.
>
> makeworld actually generates remarkably few interrupts when run on
> disk file systems (an average of only about 30 non-clock interrupts per
> second in my config).
>
>>> - reducing kern.sched.quantum never had much effect.  Same for
>>>  increasing HZ in -current with 4BSD.
>
> Bruce
>