bug in calcru() in kernel: integer overflow computing user time
Doug Ambrisko
ambrisko at ambrisko.com
Wed Jan 26 11:11:10 PST 2005
Chris Landauer writes:
|
| hihi, all -
|
| well, i have an "almost" fix for the problem - read on, ...
| (this is for discussin before send-pr submission)
|
| description
|
| in FreeBSD 5.3-RELEASE (and 5.2.1R, and 5.1R),
| in file /usr/src/sys/kern/kern_resource.c,
| lines 657-750 define a function calcru(),
| and there is a bug in that function -
|
| line 713 in that file is
|
| uu = (tu * ut) / tt,
|
| which is trying to compute a proportion (since ut<=tt) - the problem
| is that for my application, the values of tt and ut are very large and
| the value of the intermediate product overflows, leading to incorrect
| values (i reported the symptoms to freebsd-hackers and a very helpful
| description and localization of the problem to calcru() was provided
| by peter jeremy, who also wrote that it is a very old, but only partly
| known, problem)
|
| repetition
|
| use time in csh or /usr/bin/time or getrusage() to time any program
| that takes more than a week to run
|
| fix (almost)
|
| it turns out that this problem can be (partly) fixed by replacing that
| one line above with the following lines:
|
| /* we know 0 <= ut <= tt and 1 <= tt */
| if (tu >= tt)
| {
| **whatever type they need to be** q, r;
| q = tu / tt;
| r = tu % tt;
| uu = (q * ut) + (r * ut) / tt;
| }
| else uu = (tu * ut) / tt
|
| this change does not solve all the arithmetic problems (especially
| when ut is very large), and a similar change is needed for system
| time, computing su in line 714 of the file, but it suffices for me -
|
| analysis (yup, proof that it should work 8-)) -
|
| i expect that all of these counters are increasing throughout the life
| of the process -
| tu is total time in microseconds
| ut is user 128hz ticks
| tt is total 128hz ticks
| i assume therefore that
| tu ~ tt * 10^6/128
| strictly because of the clock rates
| (machines with other kinds of clock rate ratios will need a different
| analysis, but the same idea should work there, too) -
|
| in my case ut ~ tt, so we see overflow in the old computation when
| tu * ut >= 2^64
| tt^2 * 10^6/128 >= 2^64
| tt * 10^3/8*sqrt(2) >= 2^32 ~ 4 * 10^9
| tt >= 32*sqrt(2)*10^6,
| which is about
| sqrt(2)*10^6 / 4 ~ 3.54*10^5 seconds,
| or
| ~ 98 hours
| (this is the phenomenon i observed)
|
| in the new computation offered above, since we know that
| ut <= tt,
| we also know that
| uu <= tu,
| and
| (q * ut) <= uu,
| so there can be no overflow in the (q * ut) term -
| therefore, this changed code will overflow only when r is large
| r ~ tt,
| and then only when
| r * ut >= 2^64
| tt^2 >= 2^64
| tt >= 2^32,
| which is about
| ~ 2^25 seconds
| ~ 9300 hours
| ~ 388 days
| (i can live with that for now)
|
| for everyone else, it adds the one test in the if statement to every
| call to calcru() (or two, if system time is similarly fixed) -
| <philosophy> instrumentation is costly, and correct instrumentation is
| even more costly, but it's worth it, every time, to get the right
| answer </philosophy>
|
| i'm about to try it out this week - i will report the results when i
| get some data (which will be a few weeks)
|
| i'm thinking about how to solve the problem correctly for really
| long-running programs, but it's all pretty special case ugly so far
This is my fix:
Index: kern_resource.c
===================================================================
RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/kern_resource.c,v
retrieving revision 1.55.2.5
diff -u -p -r1.55.2.5 kern_resource.c
--- kern_resource.c 3 Nov 2001 01:41:08 -0000 1.55.2.5
+++ kern_resource.c 26 Jan 2005 19:03:27 -0000
@@ -552,10 +552,10 @@ calcru(p, up, sp, ip)
tu = ptu;
}
- /* Subdivide tu. */
- uu = (tu * ut) / tt;
- su = (tu * st) / tt;
- iu = tu - uu - su;
+ /* Subdivide tu. try to becareful of overflow */
+ su = tu * (st * 1024 / tt) / 1024;
+ iu = tu * (it * 1024 / tt) / 1024;
+ uu = tu - (su + iu);
/* Enforce monotonicity. */
if (uu < p->p_uu || su < p->p_su || iu < p->p_iu) {
If people like it I can commit it. It works for us. We had problems with
this as well. It's pretty simple fix. I used 1k since usage of this
tends to be % so rounding should effect that much.
Doug A.
More information about the freebsd-hackers
mailing list