git: 3d9d64aa1846 - main - kern_tc: unify timecounter to bintime delta conversion

From: Andriy Gapon <avg_at_FreeBSD.org>
Date: Tue, 30 Nov 2021 13:34:19 UTC
The branch main has been updated by avg:

URL: https://cgit.FreeBSD.org/src/commit/?id=3d9d64aa1846217eac9229f8cba5cb6646a688b7

commit 3d9d64aa1846217eac9229f8cba5cb6646a688b7
Author:     Andriy Gapon <avg@FreeBSD.org>
AuthorDate: 2021-11-30 13:23:23 +0000
Commit:     Andriy Gapon <avg@FreeBSD.org>
CommitDate: 2021-11-30 13:23:23 +0000

    kern_tc: unify timecounter to bintime delta conversion
    
    There are two places where we convert from a timecounter delta to
    a bintime delta: tc_windup and bintime_off.
    Both functions use the same calculations when the timecounter delta is
    small.  But for a large delta (greater than approximately an equivalent
    of 1 second) the calculations were different.  Both functions use
    approximate calculations based on th_scale that avoid division.  Both
    produce values slightly greater than a true value, calculated with
    division by tc_frequency, would be.  tc_windup is slightly more
    accurate, so its result is closer to the true value and, thus, smaller
    than bintime_off result.
    
    As a consequence there can be a jump back in time when time hands are
    switched after a long period of time (a large delta).  Just before the
    switch the time would be calculated with a large delta from
    th_offset_count in bintime_off.  tc_windup does the switch using its own
    calculations of a new th_offset using the large delta.  As explained
    earlier, the new th_offset may end up being less than the previously
    produced binuptime.  So, for a period of time new binuptime values may
    be "back in time" comparing to values just before the switch.
    
    Such a jump must never happen.  All the code assumes that the uptime is
    monotonically nondecreasing and some code works incorrectly when that
    assumption is broken.  For example, we have observed sleepq_timeout()
    ignoring a timeout when the sbinuptime value obtained by the callout
    code was greater than the expiration value, but the sbinuptime obtained
    in sleepq_timeout() was less than it.  In that case the target thread
    would never get woken up.
    
    The unified calculations should ensure the monotonic property of the
    uptime.
    
    The problem is quite rare as normally tc_windup should be called HZ
    times per second (typically 1000 or 100).  But it may happen in VMs on
    very busy hypervisors where a VM's virtual CPU may not get an execution
    time slot for a second or more.
    
    Reviewed by:    kib
    MFC after:      2 weeks
    Sponsored by:   Panzura LLC
---
 sys/kern/kern_tc.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index f041c5f547e0..01b4e8656090 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -213,6 +213,23 @@ tc_delta(struct timehands *th)
 	    tc->tc_counter_mask);
 }
 
+static __inline void
+bintime_add_tc_delta(struct bintime *bt, uint64_t scale,
+    uint64_t large_delta, uint64_t delta)
+{
+	uint64_t x;
+
+	if (__predict_false(delta >= large_delta)) {
+		/* Avoid overflow for scale * delta. */
+		x = (scale >> 32) * delta;
+		bt->sec += x >> 32;
+		bintime_addx(bt, x << 32);
+		bintime_addx(bt, (scale & 0xffffffff) * delta);
+	} else {
+		bintime_addx(bt, scale * delta);
+	}
+}
+
 /*
  * Functions for reading the time.  We have to loop until we are sure that
  * the timehands that we operated on was not updated under our feet.  See
@@ -224,7 +241,7 @@ bintime_off(struct bintime *bt, u_int off)
 {
 	struct timehands *th;
 	struct bintime *btp;
-	uint64_t scale, x;
+	uint64_t scale;
 	u_int delta, gen, large_delta;
 
 	do {
@@ -238,15 +255,7 @@ bintime_off(struct bintime *bt, u_int off)
 		atomic_thread_fence_acq();
 	} while (gen == 0 || gen != th->th_generation);
 
-	if (__predict_false(delta >= large_delta)) {
-		/* Avoid overflow for scale * delta. */
-		x = (scale >> 32) * delta;
-		bt->sec += x >> 32;
-		bintime_addx(bt, x << 32);
-		bintime_addx(bt, (scale & 0xffffffff) * delta);
-	} else {
-		bintime_addx(bt, scale * delta);
-	}
+	bintime_add_tc_delta(bt, scale, large_delta, delta);
 }
 #define	GETTHBINTIME(dst, member)					\
 do {									\
@@ -1392,17 +1401,8 @@ tc_windup(struct bintime *new_boottimebin)
 #endif
 	th->th_offset_count += delta;
 	th->th_offset_count &= th->th_counter->tc_counter_mask;
-	while (delta > th->th_counter->tc_frequency) {
-		/* Eat complete unadjusted seconds. */
-		delta -= th->th_counter->tc_frequency;
-		th->th_offset.sec++;
-	}
-	if ((delta > th->th_counter->tc_frequency / 2) &&
-	    (th->th_scale * delta < ((uint64_t)1 << 63))) {
-		/* The product th_scale * delta just barely overflows. */
-		th->th_offset.sec++;
-	}
-	bintime_addx(&th->th_offset, th->th_scale * delta);
+	bintime_add_tc_delta(&th->th_offset, th->th_scale,
+	    th->th_large_delta, delta);
 
 	/*
 	 * Hardware latching timecounters may not generate interrupts on