kern/76972: 64-bit integer overflow computing user cpu time in calcru() in kern_resource.c

Tue Feb 1 09:30:19 PST 2005

>Number:         76972
>Category:       kern
>Synopsis:       64-bit integer overflow computing user cpu time in calcru() in kern_resource.c
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Feb 01 17:30:17 GMT 2005
>Closed-Date:
>Last-Modified:
>Originator:     Chris Landauer
>Release:        FreeBSD 4.x <and 5.x, and all other i386>
>Organization:
aerospace integration science center, the aerospace corporation
>Environment:
System: FreeBSD all i386 <at least; probly on all others, too>

>Description:

the line numbers below are for 5.1RELEASE - for other releases, only the line
numbers are different - the code is the same, so the problem is the same (i've
observed it on 5.1R, 5.2.1R, and 5.3R systems, seen it in the code on 4.10R
systems, and i have been told that the same problem occurs on all FreeBSD
systems since BSD4.4-Lite (!))

in FreeBSD 5.1-RELEASE (and 5.2.1R, and 5.3R, and 4.10R), in file
/usr/src/sys/kern/kern_resource.c, lines 657-750 define a function calcru()
("calculate resource usage"), and there is a bug in that function - line 713
in that file is

	uu = (tu * ut) / tt;

which is trying to compute user cpu time (uu) as a proportion (since ut<=tt)
of total cpu time (tu), based on the fraction of statclock ticks (ut / tt)
that occur in user time - the problem is that for my application, the values
of tt and ut (which are counts of statclock ticks) are very large and the
value of the intermediate product overflows, even with 64-bit arithmetic,
leading to incorrect values (i reported the symptoms to freebsd-hackers and a
very helpful description and localization of the problem to calcru() was
provided by peter jeremy, who also wrote that it is a very old problem, but
only partly known because it is so rarely encountered)

>How-To-Repeat:

use time in csh or /usr/bin/time or getrusage() to time any program that takes
more than a week or so of user or system cpu time to run (they all exhibit the
same error of "losing" the user or system time, whichever is larger, and
therefore incorrectly reporting the utilization percentage)

>Fix:
<mostly>

it turns out that this problem can be (partly) fixed easily, by simply
re-ordering the computations (this idea is due to Doug Ambrisko), replacing
lines 711-714 (sorry i don't speak "patch" very well yet):

	/* Subdivide tu. */
	uu = (tu * ut) / tt;
	su = (tu * st) / tt;
	iu = tu - uu - su;

of the file with the following lines (tt is defined to be ut+st+it earlier in
the file, so we want to insist that tu = uu+su+iu):

	/* Subdivide tu. */
	su = (tu * st) / tt;
	iu = (tu * it) / tt;
	uu = tu - (su + iu);

since st and it are expected to be much smaller than ut for most long
programs, the overflow likelihood is greatly reduced - but i think that we can
actually do much better - see the full mathematical analysis at

	http://www.cs.umd.edu/~cal/math/overflow-analysis.txt

which also explains the value of the constant TOOBIG (yup, there is a proof
there that shows how well the suggested fix *should* work 8-))

this problem can be fixed (more completely) by replacing those same lines with
the following lines:

	/* Subdivide tu. be careful of overflow */
	/* we know 0 <= st, it, ut <= tt and 1 <= tt = st+it+ut */
#define TOOBIG 48592008 /* 2^35 * sqrt(2) / 10^3 */
	if (tt >= TOOBIG && tu >= tt)
		{
		u_int64_t q, r;
		q = tu / tt;
		r = tu % tt;
		su = (q * st) + (r * st) / tt;
		iu = (q * it) + (r * it) / tt;
		}
	else	{
		su = (tu * st) / tt;
		iu = (tu * it) / tt;
		}
	uu = tu - (su + iu);

the constant TOOBIG is designed so the calcru() function does not use the
complicated form of the equations when the numbers are guaranteed to be small
enough not to overflow in the intermediate products

this change does not solve all the arithmetic problems (especially when either
st or it is very large), but this much suffices for my needs -

my assumptions for the analysis are as follows (if any of them is incorrect,
please tell me):
	tu is total time in microseconds
	tt is total 128hz ticks (tt = ut + st + it)
	st is system 128hz ticks
	it is interrupt 128hz ticks
	ut is user 128hz ticks
i assume that all of these counters are non-negative and increasing throughout
the life of the process -
i assume therefore that (i use "~" to mean "very roughly near")
	tu ~ tt * 10^6/128
strictly because of the clock rates (machines with other kinds of clock rate
ratios will need a different analysis, but the same kind of idea should work
there, too, possibly with different constant values for TOOBIG) -

in my case ut ~ tt (i have utilizations for this particular program
usually well above 90%, often 99%, for weeks at a time), so we see
overflow in the old computation when
	tu * ut >= 2^64,
which happens when tt is about
		~ 3.8*10^5 seconds
		~ 105.45 hours
this is the phenomenon i observed

visible effect of making the change

for long programs (that is, programs that take more than 388 days or so of
either system or interrupt cpu time), this change may not help, and will
certainly not help after long enough, as shown in the analysis

for medium length programs (that is, programs that take more than 105 hours or
so of either system or interrupt cpu time, but less than 388 days or so each
of system and interrupt cpu time), this change adds both tests from the if
statement to every call to calcru(), and does in fact fix the problem

for short programs (that is, programs that take less than 105 hours or so each
of system and interrupt cpu time), it adds the first test in the if statement
to every call to calcru(), and does not otherwise affect the correctness of
the computation

<philosophy> instrumentation is costly, and correct instrumentation is even
more costly, but it's worth it, every time, to get the right answer
</philosophy> (unless, of course, you don't want or need it 8-))

both of these changes are being tested (05jan26), and both of them pass the
first error threshold - the experimental results and analysis can be found on
the web

	http://www.cs.umd.edu/~cal/math/overflow-analysis.txt

more later,
cal

Dr. Christopher Landauer
Aerospace Integration Science Center
The Aerospace Corporation, Mail Stop M6/214
P.O.Box 92957
Los Angeles, California 90009-2957, USA
cal at aero.org
+1 (310) 336-1361
>Release-Note:
>Audit-Trail:
>Unformatted: