[Bug 236096] top shows WCPU numbers greater than 100 percent when using SCHED_BSD

Thu Feb 28 05:27:38 UTC 2019

On Thu, 28 Feb 2019 a bug that doesn't want replies at freebsd.org wrote:

> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236096
>
> After switching from SCHED_ULE to SCHED_4BSD I immediately noticed that top

Congratulations on the switch.  SCHED_ULE is slightly better, but I use my
version SCHED_4BSD and have fixed it to work slightly better than SCHED_ULE
in cases that I care about.  Scheduling is unimportant in most cases, since
under light loads on SMP systems it is easy to find a spare CPU and under
heavy loads it is impossible to find a spare CPU and hard to do better than
choose a non-spare one at random.

> displays wildly inaccurate numbers for WCPU.  If you switch the display to
> un-weighted CPU the numbers are mostly right (rarely you'll see a 101% type
> number).  This is on a 6-core/12-thread system running make universe:

This was broken almost 5 years ago in r266906.  r267685 is supposed to
fix the percentages going over 100%, but the percentages are still
garbage for 4BSD and not too good for ULE and often go over 100% for at
least 4BSD.

> last pid: 93675;  load averages: 11.55, 12.01, 11.34    up 0+09:47:22  18:07:54
> 277 processes: 12 running, 265 sleeping
> CPU:  0.3% user, 71.7% nice,  5.8% system,  0.1% interrupt, 22.2% idle
> Mem: 2727M Active, 6622M Inact, 144M Laundry, 1978M Wired, 1174M Buf, 533M Free
> Swap: 32G Total, 32G Free
>
>  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
> 93198 ilepore       1  95   20   101M    71M CPU6     6   0:02 1669.79% cc
> 93380 ilepore       1  95   20    80M    50M RUN      8   0:01 1092.95% cc
> 93366 ilepore       1  95   20    80M    51M CPU2     2   0:01 1090.72% cc
> 93381 ilepore       1  95   20    80M    50M CPU0     0   0:01 1075.67% cc
> 93365 ilepore       1  95   20    80M    50M CPU10   10   0:01 1075.13% cc
> 93367 ilepore       1  95   20    80M    50M CPU5     5   0:01 1031.28% cc
> 93378 ilepore       1  95   20    80M    50M CPU9     9   0:01 1027.59% cc
> 93379 ilepore       1  95   20    80M    50M CPU7     7   0:01 1026.13% cc

The bug is essentially division by 0.  More precisely, it is division by
(1 - exp(k*t)), where k is not Boltzmann's constant (it is log(ccpu)) and
and t is time.  This division corrects from raw CPU to weighted CPU.

This is not really valid for ULE, and ULE mis-emulates it by setting
ccpu to 0.  log(0) is -Infinity, so when t is 0 the result is NaN and
the percentage is displayed as something like "nan", but for other t
exp(k*t) is 0 so the divisor is 1 and the conversion is null.  WCPU is
just worse than useless for ULE, since it is the same as CPU except
when it is NaN.  This gave the original bug of WCPU taking a long time
to ramp up to 100% for a thread that uses 100%.

For 4BSD, ccpu is about 0.95 represented as an integer.  This corresponds
to a 95% in 60 seconds decay rate for raw CPU.  This is not broken, but
raw CPU is broken by making it almost the same as the correct WCPU using
a bad method, so that applying the correct conversion makes it start at
nearly infinity and stay above 100% for a long time (for 100% actual use,
about 105% after 1 minute by dividing by 0.95).

Further details for 4BSD: when t is 0, (1 - exp(k*t)) is 0, so there must
be a test somewhere to avoid dividing by this.  This test apparently
doesn't or didn't work for avoiding NaNs for ULE.  I observed the NaNs
but didn't check the code.  A special case for t = 0 would work for both,
but the apparently-more robust check for (1 - exp(k*t)) != 0 fails when
the LHS is NaN.  When t is small, CPU must also be small so that division
by nearly 0 doesn't give much larger percentages than 100%.  E.g, if

The main breakage in r266906 was to ignore the kernel's long-term average
("raw") %CPU (except initially) and use the average over the last top
update interval.  This gives the following observable bugs:
- %CPU is broken for its reason for existence of showing the raw kernel
   %CPU
- ps doesn't have this bug, so %CPU in top is inconsistent with  %CPU in ps
- there is little documentation about this, and what there is is misleading.
   ps says that -C gives a "raw" CPU that ignores "resident" time and that
   this normally has no effect.  Actually, the "resident" time has almost
   no effect, but -C gives a very large difference by _not_ dividing by
   (1 - exp(k*t)).  Except for ULE, the details are different.  Then -C
   really does notmally have no effect, since k is broken.
- NaNs sometimes.  I might have only seen them for ps.
- garbage %WCPU for 4BSD.  You can almost recover by never using %WCPU.
   Use only %CPU.  It is similar to what %WCPU should be.
- lots of jitter in %CPU and %WCPU.  It is impossible for it to be very
   accurate since it is measured over a short interval.  For a long-lived
   thread taking 100% CPU, the displayed %*CPU is often off by +- a few
   percent, while the kernel's long-term average is stable at nearly 100%
   (a bit below that for 4BSD since 4BSD often reduces it by 5%).

It might be useful to display transients, but this should be on a separate
display named something like %TCPU.  Transients are bad for most purposes,
especially for sorting on %*CPU, since they change the values and/or order
a lot.  The result would be:

- %CPU shows shows kernel CPU
- %WCPU shows weighted CPU.  This needs to be fixed for ULE.  For ULE, the
   raw CPU is mis-emulated and is more like WCPU, but it doesn't ramp up
   as fast as it should for WCPU, giving the original bug.  A correct
   emulation would emulate 4BSD's raw CPU, with ccpu probably different
   but not 0, but it would be better for ULE's CPU to be fully raw and
   do better conversions to WCPU in userland (don't use the (1 - exp(k*t))
   factor for ULE.  ULE doesn't keep as much history as 4BSD, but it keeps
   some, so raw CPU is small initially when WCPU should be 100%.
- %TCPU shows transient CPU.

The following quick fix is enough for 4BSD:

XX Index: machine.c
XX ===================================================================
XX --- machine.c	(revision 341138)
XX +++ machine.c	(working copy)
XX @@ -661,7 +661,7 @@
XX  {
XX  	const struct kinfo_proc *oldp;
XX 
XX -	if (previous_interval != 0) {
XX +	if (0 && previous_interval != 0) {
XX  		oldp = get_old_proc(pp);
XX  		if (oldp != NULL)
XX  			return ((double)(pp->ki_runtime - oldp->ki_runtime)

This just never uses the transient CPU for any scheduler.  So %WCPU and
%CPU work correctly as in old versions for 4BSD, and %CPU works almost
correctly as in old versions for ULE (it is just not raw enough, and
most users don't understand that it is raw), and %WCPU is broken as in
old versions for ULE (it is just the same as %WCPU, so is not what users
should expect).

I asked the author of the bug to fix it about a year ago, and provided
a different long explanation than the above and a less-quick fix:

XX Index: machine.c
XX ===================================================================
XX --- machine.c	(revision 331608)
XX +++ machine.c	(working copy)
XX @@ -89,6 +89,7 @@
XX 
XX  /* define what weighted cpu is.  */
XX  #define weighted_cpu(pct, pp) ((pp)->ki_swtime == 0 ? 0.0 : \
XX +			 sched_ule ? (pct) : \
XX  			 ((pct) / (1.0 - exp((pp)->ki_swtime * logcpu))))
XX 
XX  /* what we consider to be process size: */

Also, don't waste time calculating 1 using exp(-Inf) for the WCPU &&
SCHED_ULE case.

XX @@ -147,6 +148,7 @@
XX  /* these are retrieved from the kernel in _init */
XX 
XX  static load_avg  ccpu;
XX +static int sched_ule;
XX 
XX  /* these are used in the get_ functions */
XX 
XX @@ -331,6 +333,7 @@
XX  	boolean_t carc_en;
XX  	size_t size;
XX  	struct passwd *pw;
XX +	char name[4];
XX 
XX  	size = sizeof(smpmode);
XX  	if ((sysctlbyname("machdep.smp_active", &smpmode, &size,
XX @@ -365,6 +368,10 @@
XX  	if (kd == NULL)
XX  		return (-1);
XX 
XX +	size = sizeof(name);
XX +	sched_ule = (sysctlbyname("kern.sched.name", &name[0], &size,
XX +	    NULL, 0) == 0 && strcmp(name, "ULE") == 0);
XX +
XX  	GETSYSCTL("kern.ccpu", ccpu);
XX 
XX  	/* this is used in calculating WCPU -- calculate it ahead of time */
XX @@ -715,6 +722,13 @@
XX   * If there was a previous update, use the delta in ki_runtime over
XX   * the previous interval to calculate pctcpu.  Otherwise, fall back
XX   * to using the kernel's ki_pctcpu.
XX + *
XX + * XXX: the kernel's ki_pctcpu is the correct one, but we don't know
XX + * how to scale it to WCPU for SCHED_ULE (we used to scale by SCHED_4BSD's
XX + * factor or 1/(1-exp(k*t) where k = log(ccpu) in all cases.  For
XX + * SCHED_ULE ccpu is 0 so k is -infinity and the factor is 1 which
XX + * doesn't do too much damage).  Actually only clobber the kernel's
XX + * value for SCHED_ULE && WCPU.
XX   */
XX  static double
XX  proc_calc_pctcpu(struct kinfo_proc *pp)
XX @@ -721,7 +735,7 @@
XX  {
XX  	const struct kinfo_proc *oldp;
XX 
XX -	if (previous_interval != 0) {
XX +	if (sched_ule && ps.wcpu && previous_interval != 0) {
XX  		oldp = get_old_proc(pp);
XX  		if (oldp != NULL)
XX  			return ((double)(pp->ki_runtime - oldp->ki_runtime)

This doesn't go as far as adding %TCPU of fixing %WCPU properly for ULE,
or fixing ps and other utilities, or fixing the documentation.  top(1)
misdocuments WCPU by saying that it is the weighted CPU and is the same
as ps displays.  It doesn't say what weighting is, or document that this
is scheduler-dependent with different bugs or that the bugs make this not
the same as ps displays.

ps should have an option to display transient %CPU too, and should have
keywords to select cpu, wcpu and tcpu.  It is otherwise more programmable
than top, so can display all of these in narrow displays by omitting
other columns.  Its -C option switches from WCPU (keyword %cpu; column
header %CPU) to CPU without even changing the name in the header.  ps
already supports the confusingly similar keyword "cpu" (column header
CPU).  This is even rawer than actual %CPU, so it shouldn't be any
standard displays, but it is in "ps l" output while %[W]CPU is not.

CPU is 0 for all processes on freefall now.  That is another bug in ULE.
ULE doesn't use the ts_estcpu variable or logic -- that is 4BSD-specific.
ULE doesn't emulate this either, but always returns 0 in sched_estcpu().

ULE does emulate %CPU.  Its internal state for this is just ts_tick, which
is similar to ts_estcpu in 4BSD (but rawer and averaged over a default of
only 10 seconds instead of retaining 5% after 1 minute for 4BSD).
Returning this in sched_estcpu() would make sense, but users wouldn't
know how to scale it and it doesn't provide any more information than
%CPU, so this would mainly confuse users by putting strange nonzero values
in ps l output.

I now remember trying to fix %CPU for ULE without knowing or wanting to
know much about ULE.  I tried using scaling by (1 - exp(k*t)) to adjust
%CPU down and %WCPU up.  Nothing worked.  The reasons are now clearer.
Since ts_tick is only an average over 10 seconds, exponential scaling
just doesn't apply to it.  But it seems to ramp up much like %CPU in
4BSD.  That must be just that after 1 second out of 10, 100% of CPU is
seen as only 10% CPU.  Userland should see this 10% and scale it to 100%.
But the API (mostly struct kinfo) is still specified for 4BSD, so it
doesn't contain information needed for this scaling.  The kernel may
as well do it.  Then its value of 0 for ccpu would be correct (it means
that %CPU is already scaled).

Averaging over 10 seconds in the kernel gives lots of jitter too.  It
gives even more jitter by using tick counts instead of precise runtimes.
Yet somehow it seems to give less jitter than top with an interval of
1 or 2.

Try top with an interval of 0 to see enormous jitter.  The actual interval
is slightly larger than 0, so top rarely falls back to the old method to
avoid division by 0.  Instead it sees sharp transients.  top should support
more useful short intervals like 0.1, but it misparses 0.1 as 0.  The
interval of 0 is restricted to root.  This is bogus, since anyway can use
even more resources by execing top in a loop.

Bruce