kern/76972: 64-bit integer overflow computing user cpu time in calcru() in kern_resource.c

Wed Feb 2 01:10:21 PST 2005

The following reply was made to PR kern/76972; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: Chris Landauer <cal at aero.org>
Cc: FreeBSD-gnats-submit at FreeBSD.org, freebsd-bugs at FreeBSD.org
Subject: Re: kern/76972: 64-bit integer overflow computing user cpu time in
 calcru() in kern_resource.c
Date: Wed, 2 Feb 2005 20:02:23 +1100 (EST)

 On Tue, 1 Feb 2005, Chris Landauer wrote:

 > >Description:
 > in FreeBSD 5.1-RELEASE (and 5.2.1R, and 5.3R, and 4.10R), in file
 > /usr/src/sys/kern/kern_resource.c, lines 657-750 define a function calcru()
 > ("calculate resource usage"), and there is a bug in that function - line 713
 > in that file is
 >
 > 	uu = (tu * ut) / tt;
 >
 > which is trying to compute user cpu time (uu) as a proportion (since ut<=tt)
 > of total cpu time (tu), based on the fraction of statclock ticks (ut / tt)
 > that occur in user time - the problem is that for my application, the values
 > of tt and ut (which are counts of statclock ticks) are very large and the
 > value of the intermediate product overflows, even with 64-bit arithmetic,
 > leading to incorrect values (i reported the symptoms to freebsd-hackers and a
 > very helpful description and localization of the problem to calcru() was
 > provided by peter jeremy, who also wrote that it is a very old problem, but
 > only partly known because it is so rarely encountered)

 PR 17842 is about this too.  It has the 105 hours magic number but otherwise
 less detail, so I superseded it by this PR.

 > >Fix:
 > <mostly>
 > ...
 > this problem can be fixed (more completely) by replacing those same lines with
 > the following lines:
 >
 > 	/* Subdivide tu. be careful of overflow */
 > 	/* we know 0 <= st, it, ut <= tt and 1 <= tt = st+it+ut */
 > #define TOOBIG 48592008 /* 2^35 * sqrt(2) / 10^3 */
 > 	if (tt >= TOOBIG && tu >= tt)
 > 		{
 > 		u_int64_t q, r;
 > 		q = tu / tt;
 > 		r = tu % tt;
 > 		su = (q * st) + (r * st) / tt;
 > 		iu = (q * it) + (r * it) / tt;
 > 		}
 > 	else	{
 > 		su = (tu * st) / tt;
 > 		iu = (tu * it) / tt;
 > 		}
 > 	uu = tu - (su + iu);

 Seems basically OK.

 > the constant TOOBIG is designed so the calcru() function does not use the
 > complicated form of the equations when the numbers are guaranteed to be small
 > enough not to overflow in the intermediate products

 I think you should use 2^32 for TOOBIG so that the only assumption on
 stathz is that it is <= 10^6.  2^32 microseconds is a bit over an hour
 and most programs won't run that long.  The tu >= tt check is automatically
 satisfied if stathz <= 10^6 (less epsilon?), but it is reasonable to check
 that both terms of the multiplication are < 2^32.  You can even make
 the fix have negative inefficiency for the usual case on 32-bit machines
 by casting the terms down to uint32_t when they are < 2^32.  For i386's
 in particular, 32x32->64 bit multiplication is a single instruction and
 gcc would use it if the terms were uint32_t.

 > ...
 > my assumptions for the analysis are as follows (if any of them is incorrect,
 > please tell me):
 > 	tu is total time in microseconds
 > 	tt is total 128hz ticks (tt = ut + st + it)
 > 	st is system 128hz ticks
 > 	it is interrupt 128hz ticks
 > 	ut is user 128hz ticks
 > i assume that all of these counters are non-negative and increasing throughout
 > the life of the process -
 > i assume therefore that (i use "~" to mean "very roughly near")
 > 	tu ~ tt * 10^6/128
 > strictly because of the clock rates (machines with other kinds of clock rate
 > ratios will need a different analysis, but the same kind of idea should work
 > there, too, possibly with different constant values for TOOBIG) -

 Seems OK.  stathz is machine-dependent -- better avoid tricky assumptions
 about it (see above).  It seems to be the same as hz on sparc64's, so it
 might be much larger than 128 (however, the scheduler depends on it being
 about 128).

 > for long programs (that is, programs that take more than 388 days or so of
 > either system or interrupt cpu time), this change may not help, and will
 > certainly not help after long enough, as shown in the analysis

 Hmm, 388 days is 2^32 statclock ticks at 128Hz and the limit of 2^32
 is sort of new.  BTW, I've always thought it stupid that it is stupid for
 the tick counters to be 64 bits.  Values larger than 2^32 in them never
 came close to actually working, since overflow occurs in the multiplication
 by a factor of (388 days) / (105 hours) sooner than 32-bit counters would
 overflow.  The counters are only (?) used by calcru(), so their 64-bitness
 is not useful anywhere.

 Bruce