bug in calcru() in kernel: integer overflow computing user time

Doug Ambrisko ambrisko at ambrisko.com
Wed Jan 26 11:11:10 PST 2005


Chris Landauer writes:
| 
| hihi, all -
| 
| well, i have an "almost" fix for the problem - read on, ...
| (this is for discussin before send-pr submission)
| 
| description
| 
| 	in FreeBSD 5.3-RELEASE (and 5.2.1R, and 5.1R),
| 	in file /usr/src/sys/kern/kern_resource.c,
| 	lines 657-750 define a function calcru(),
| 	and there is a bug in that function -
| 
| 	line 713 in that file is
| 
| 		 uu = (tu * ut) / tt,
| 
| 	which is trying to compute a proportion (since ut<=tt) - the problem
| 	is that for my application, the values of tt and ut are very large and
| 	the value of the intermediate product overflows, leading to incorrect
| 	values (i reported the symptoms to freebsd-hackers and a very helpful
| 	description and localization of the problem to calcru() was provided
| 	by peter jeremy, who also wrote that it is a very old, but only partly
| 	known, problem)
| 
| repetition
| 
| 	use time in csh or /usr/bin/time or getrusage() to time any program
| 	that takes more than a week to run
| 
| fix (almost)
| 
| 	it turns out that this problem can be (partly) fixed by replacing that
| 	one line above with the following lines:
| 
| 	/* we know 0 <= ut <= tt and 1 <= tt */
| 	if (tu >= tt)
| 		{
| 		**whatever type they need to be** q, r;
| 		q = tu / tt;
| 		r = tu % tt;
| 		uu = (q * ut) + (r * ut) / tt;
| 		}
| 	else	uu = (tu * ut) / tt
| 
| 	this change does not solve all the arithmetic problems (especially
| 	when ut is very large), and a similar change is needed for system
| 	time, computing su in line 714 of the file, but it suffices for me -
| 
| 	analysis (yup, proof that it should work 8-)) -
| 
| 	i expect that all of these counters are increasing throughout the life
| 	of the process -
| 		tu is total time in microseconds
| 		ut is user 128hz ticks
| 		tt is total 128hz ticks
| 	i assume therefore that
| 		tu ~ tt * 10^6/128
| 	strictly because of the clock rates
| 	(machines with other kinds of clock rate ratios will need a different
| 	analysis, but the same idea should work there, too) -
| 
| 	in my case ut ~ tt, so we see overflow in the old computation when
| 		tu * ut >= 2^64
| 		tt^2 * 10^6/128 >= 2^64
| 		tt * 10^3/8*sqrt(2) >= 2^32 ~ 4 * 10^9
| 		tt >= 32*sqrt(2)*10^6,
| 	which is about
| 		sqrt(2)*10^6 / 4 ~ 3.54*10^5 seconds,
| 	or
| 			~ 98 hours
| 	(this is the phenomenon i observed)
| 
| 	in the new computation offered above, since we know that
| 		ut <= tt,
| 	we also know that
| 		uu <= tu,
| 	and
| 		(q * ut) <= uu,
| 	so there can be no overflow in the (q * ut) term -
| 	therefore, this changed code will overflow only when r is large
| 		r ~ tt,
| 	and then only when
| 		r * ut >= 2^64
| 		tt^2 >= 2^64
| 		tt >= 2^32,
| 	which is about
| 			~ 2^25 seconds
| 			~ 9300 hours
| 			~ 388 days
| 	(i can live with that for now)
| 
| 	for everyone else, it adds the one test in the if statement to every
| 	call to calcru() (or two, if system time is similarly fixed) -
| 	<philosophy> instrumentation is costly, and correct instrumentation is
| 	even more costly, but it's worth it, every time, to get the right
| 	answer </philosophy>
| 
| 	i'm about to try it out this week - i will report the results when i
| 	get some data (which will be a few weeks)
| 
| 	i'm thinking about how to solve the problem correctly for really
| 	long-running programs, but it's all pretty special case ugly so far

This is my fix:

Index: kern_resource.c
===================================================================
RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/kern_resource.c,v
retrieving revision 1.55.2.5
diff -u -p -r1.55.2.5 kern_resource.c
--- kern_resource.c	3 Nov 2001 01:41:08 -0000	1.55.2.5
+++ kern_resource.c	26 Jan 2005 19:03:27 -0000
@@ -552,10 +552,10 @@ calcru(p, up, sp, ip)
 		tu = ptu;
 	}
 
-	/* Subdivide tu. */
-	uu = (tu * ut) / tt;
-	su = (tu * st) / tt;
-	iu = tu - uu - su;
+	/* Subdivide tu. try to becareful of overflow */
+	su = tu * (st * 1024 / tt) / 1024;
+	iu = tu * (it * 1024 / tt) / 1024;
+	uu = tu - (su + iu);
 
 	/* Enforce monotonicity. */
 	if (uu < p->p_uu || su < p->p_su || iu < p->p_iu) {

If people like it I can commit it.  It works for us.  We had problems with
this as well.  It's pretty simple fix.  I used 1k since usage of this
tends to be % so rounding should effect that much.

Doug A.


More information about the freebsd-hackers mailing list