svn commit: r208585 - head/sys/mips/mips

Thu May 27 12:22:54 UTC 2010

On Thu, 27 May 2010, Alexander Motin wrote:

> Neel Natu wrote:
>> However it is not immediately obvious why we prefer to run the
>> statistics timer at (or very close to) 128Hz. Any pointers?
>
> I haven't looked myself, but sources report that some legacy code depend
> on it:
> http://lists.freebsd.org/pipermail/freebsd-arch/2009-December/009731.html

That's a good reference for newer scheduler problems.  The following from
cvs history is better for the 128:

% RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v

History in sched_4bsd.c was broken by not repo-copying.

% Working file: kern_synch.c
% head: 1.249
% ...
% ----------------------------
% revision 1.83
% date: 1999/11/28 12:12:13;  author: bde;  state: Exp;  lines: +11 -13
% Scheduler fixes equivalent to the ones logged in the following NetBSD
% commit to kern_synch.c:
% 
%   ----------------------------
%   revision 1.55
%   date: 1999/02/23 02:56:03;  author: ross;  state: Exp;  lines: +39 -10
%   Scheduler bug fixes and reorganization
%   * fix the ancient nice(1) bug, where nice +20 processes incorrectly
%     steal 10 - 20% of the CPU, (or even more depending on load average)
%   * provide a new schedclk() mechanism at a new clock at schedhz, so high
%     platform hz values don't cause nice +0 processes to look like they are
%     niced
%   * change the algorithm slightly, and reorganize the code a lot
%   * fix percent-CPU calculation bugs, and eliminate some no-op code
% 
%   === nice bug === Correctly divide the scheduler queues between niced and
%   compute-bound processes. The current nice weight of two (sort of, see

2 or 4 was the historical value.

%   `algorithm change' below) neatly divides the USRPRI queues in half; this
%   should have been used to clip p_estcpu, instead of UCHAR_MAX.  Besides
%   being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
%   and it was done after decay_cpu() which can only _reduce_ the value.  It
%   has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
%   scheduler-penalize themselves onto the same queue as nice +20 processes.
%   (Or even a higher one.)
% 
%   === New schedclk() mechansism === Some platforms should be cutting down
%   stathz before hitting the scheduler, since the scheduler algorithm only
%   works right in the vicinity of 64 Hz. Rather than prescale hz, then scale

The historical value was probably 60.

%   back and forth by 4 every time p_estcpu is touched (each occurance an
%   abstraction violation), use p_estcpu without scaling and require schedhz
%   to be generated directly at the right frequency. Use a default stathz (well,
%   actually, profhz) / 4, so nothing changes unless a platform defines schedhz
%   and a new clock.  Define these for alpha, where hz==1024, and nice was
%   totally broke.
% 
%   === Algorithm change === The nice value used to be added to the
%   exponentially-decayed scheduler history value p_estcpu, in _addition_ to
%   be incorporated directly (with greater wieght) into the priority calculation.
%   At first glance, it appears to be a pointless increase of 1/8 the nice

Perhaps I am confused by where the above factor of 2 or 4 was, and the 8
came directly from this 1/8.  Anyway, the final version attempts to fold
the factors together if possible.

%   effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
%   because it will ramp up linearly but be decayed only exponentially, thus
%   converging to an additional .75 nice for a loadaverage of one. I killed
%   this, it makes the behavior hard to control, almost impossible to analyze,
%   and the effect (~~nothing at for the first second, then somewhat increased
%   niceness after three seconds or more, depending on load average) pointless.
% 
%   === Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
%   Collect scheduler functionality. Try to put each abstraction in just one
%   place.
%   ----------------------------
% 
% The details are a little different in FreeBSD:
% 
% === nice bug ===   Fixing this is the main point of this commit.  We use
% essentially the same clipping rule as NetBSD (our limit on p_estcpu
% differs by a scale factor).  However, clipping at all is fundamentally
% bad.  It gives free CPU the hoggiest hogs once they reach the limit, and
% reaching the limit is normal for long-running hogs.  This will be fixed
% later.
% 
% === New schedclk() mechanism ===  We don't use the NetBSD schedclk()
% (now schedclock()) mechanism.  We require (real)stathz to be about 128
% and scale by an extra factor of 2 compared with NetBSD's statclock().

Later another factor of to was added, giving a factor of 8.

Later still, another factor of smp_ncpus was added.

These factors reduce overflow/clamping.

% We scale p_estcpu instead of scaling the clock.  This is more accurate
% and flexible.
% 
% === Algorithm change ===  Same change.
% 
% === Other bugs ===  The p_pctcpu bug was fixed long ago.  We don't try as
% hard to abstract functionality yet.
% 
% Related changes: the new limit on p_estcpu must be exported to kern_exit.c
% for clipping in wait1().
% 
% Agreed with by:		dufault
% ----------------------------

% > In any case it should not be equal to hz whenever possible.

More precisely, stathz should not be a divisor of hz.

I think that requirement is mostly a hack that helps with independent
hardware clocks.  If the clocks had identical frequencies, then they
would be mostly out of sync, but occasionally they would get in sync,
and then their identical frequencies would keep them in sync for a
long time determined by how closely their frequencies are equal and
stable.

Using different frequencies significantly reduces the frequency of
perfect synchronization -- after the clocks become in perfect sync,
their next interrupts are at times separated by (1/hz - 1/stathz).
These times are still too predictable, but the difference is far
from 0.

With a single higher frequency clock divided down into various sub-clocks,
it can be arranged that the differences for the sub-clocks are even
further from 0.  The times of all the pseudo-interrupts for all the
sub-clocks would be even more predictable, but I think this is not a problem
iff stathz is much larger than hz.  (hz much larger than stathz is a
problem even with an independent aperiodic statclock.  Then it is easy
for a malicious program to observe statclock activity using timeouts
at the much larger hardclock frequency, and possible to predict future
statclock activity since "hz much larger than stathz" means that any
randomness in statclock is not so large as to significantly change the
average time until the next statclock interrupt.)

Bruce