powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Sun Mar 3 11:19:45 UTC 2019

On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> 
> > On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>> ...
> >>> So I am able to reproduce it with some surprising ease on HPET running
> >>> on Haswell.
> >>
> >> So what is the cause of it?  Maybe the tickless code doesn't generate
> >> fake clock ticks right.  Or it is just a library bug.  The kernel has
> >> to be slightly real-time to satisfy the requirement of 1 update per.
> >> Applications are further from being real-time.  But isn't it enough
> >> for the kernel to ensure that the timehands cycle more than once per
> >> second?
> > No, I entered ddb as you suggested.
> 
> But using ddb is not normal.  It is convenient that this fixes HPET and
> ACPI timecounters after using ddb, but this method doesn't help for
> timecounters that wrap fast.  TSC-low at 2GHz wraps in 2 seconds, and
> i8254 wraps in a few milliseconds.
> 
> >> I don't changing this at all this.  binuptime() was carefully written
> >> to not need so much 64-bit arithmetic.
> >>
> >> If this pessimization is allowed, then it can also handle a 64-bit
> >> deltas.  Using the better kernel method:
> >>
> >>  		if (__predict_false(delta >= th->th_large_delta)) {
> >>  			bt->sec += (scale >> 32) * (delta >> 32);
> >>  			x = (scale >> 32) * (delta & 0xffffffff);
> >>  			bt->sec += x >> 32;
> >>  			bintime_addx(bt, x << 32);
> >>  			x = (scale & 0xffffffff) * (delta >> 32);
> >>  			bt->sec += x >> 32;
> >>  			bintime_addx(bt, x << 32);
> >>  			bintime_addx(bt, (scale & 0xffffffff) *
> >>  			    (delta & 0xffffffff));
> >>  		} else
> >>  			bintime_addx(bt, scale * (delta & 0xffffffff));
> > This only makes sense if delta is extended to uint64_t, which requires
> > the pass over timecounters.
> 
> Yes, that was its point.  It is a bit annoying to have a hardware
> timecounter like the TSC that doesn't wrap naturally, but then make it
> wrap by masking high bits.
> 
> The masking step is also a bit wasteful.  For the TSC, it is 1 step to
> discard high bids at the register level, then another step to apply the
> nask to discard th high bits again.
rdtsc-low is implemented in the natural way, after RDTSC, no register
combining into 64bit value is done, instead shrd operates on %edx:%eax
to get the final result into %eax.  I am not sure what you refer to.

> 
> >> I just noticed that there is a 64 x 32 -> 64 bit multiplication in the
> >> current method.  This can be changed to do expicit 32 x 32 -> 64 bit
> >> multiplications and fix the overflow problem at small extra cost on
> >> 32-bit arches:
> >>
> >>  		/* 32-bit arches did the next multiplication implicitly. */
> >>  		x = (scale >> 32) * delta;
> >>  		/*
> >>  		 * And they did the following shifts and most of the adds
> >>  		 * implicitly too.  Except shifting x left by 32 lost the
> >>  		 * seconds part that the next line handles.  The next line
> >>  		 * is the only extra cost for them.
> >>  		 */
> >>  		bt->sec += x >> 32;
> >>  		bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta);
> >
> > Ok, what about the following.
> 
> I'm not sure that I really want this, even if the pessimization is done.
> But it avoids using fls*(), so is especially good for 32-bit systems and
> OK for 64-bit systems too, especially in userland where fls*() is in the
> fast path.
For userland I looked at the generated code, and BSR usage seems to be
good enough, for default compilation settings with clang.

> 
> >
> > diff --git a/lib/libc/sys/__vdso_gettimeofday.c b/lib/libc/sys/__vdso_gettimeofday.c
> > index 3749e0473af..cfe3d96d001 100644
> > --- a/lib/libc/sys/__vdso_gettimeofday.c
> > +++ b/lib/libc/sys/__vdso_gettimeofday.c
> > @@ -32,6 +32,8 @@ __FBSDID("$FreeBSD$");
> > #include <sys/time.h>
> > #include <sys/vdso.h>
> > #include <errno.h>
> > +#include <limits.h>
> 
> Not needed with 0xffffffff instead of UINT_MAX.
> 
> The userland part is otherwise little changed.
Yes, see above.  If ABI for shared page going to be changed in some future,
I will export th_large_delta as well.

> 
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 2656fb4d22f..2e28f872229 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> > ...
> > @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp)
> > 	} while (gen == 0 || gen != th->th_generation);
> > }
> > #else /* !FFCLOCK */
> > +
> > +static void
> > +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> > +{
> > +	uint64_t x;
> > +
> > +	x = (*scale >> 32) * delta;
> > +	*scale &= 0xffffffff;
> > +	bt->sec += x >> 32;
> > +	bintime_addx(bt, x << 32);
> > +}
> 
> It is probably best to not inline the slow path, but clang tends to
> inline everything anyway.
It does not matter if it inlines it, as far as it is moved out of the
linear sequence for the fast path.

> 
> I prefer my way of writing this in 3 lines.  Modifying 'scale' for
> the next step is especially ugly and pessimal when the next step is
> in the caller and this function is not inlined.
Can you show exactly what do you want ?

> 
> > +
> > void
> > binuptime(struct bintime *bt)
> > {
> > 	struct timehands *th;
> > -	u_int gen;
> > +	uint64_t scale;
> > +	u_int delta, gen;
> >
> > 	do {
> > 		th = timehands;
> > 		gen = atomic_load_acq_int(&th->th_generation);
> > 		*bt = th->th_offset;
> > -		bintime_addx(bt, th->th_scale * tc_delta(th));
> > +		scale = th->th_scale;
> > +		delta = tc_delta(th);
> > +#ifdef _LP64
> > +		/* Avoid overflow for scale * delta. */
> > +		if (__predict_false(th->th_large_delta <= delta))
> > +			bintime_helper(bt, &scale, delta);
> > +		bintime_addx(bt, scale * delta);
> > +#else
> > +		/*
> > +		 * Also avoid (uint64_t, uint32_t) -> uint64_t
> > +		 * multiplication on 32bit arches.
> > +		 */
> 
> "Also avoid overflow for ..."
> 
> > +		bintime_helper(bt, &scale, delta);
> > +		bintime_addx(bt, (u_int)scale * delta);
> 
> The cast should be to uint32_t, but better write it as & 0xffffffff as
> elsewhere.
> 
> bintime_helper() already reduced 'scale' to 32 bits.  The cast might be
> needed to tell the compiler this, especially when the function is not
> inlined.  Better not do it in the function.  The function doesn't even
> use the reduced value.
I used cast to use 32x32 multiplication.  I am not sure that all (or any)
compilers are smart enough to deduce that they can use 32 bit mul.

> 
> bintime_helper() is in the fast path in this case, so should be inlined.
> 
> > +#endif
> > 		atomic_thread_fence_acq();
> > 	} while (gen == 0 || gen != th->th_generation);
> > }
> 
> This needs lots of testing of course.

Current kernel-only part of the change is below, see the question about
your preference for binuptime_helper().

diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 2656fb4d22f..6c41ab22288 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -72,6 +71,7 @@ struct timehands {
 	struct timecounter	*th_counter;
 	int64_t			th_adjustment;
 	uint64_t		th_scale;
+	uint64_t		th_large_delta;
 	u_int	 		th_offset_count;
 	struct bintime		th_offset;
 	struct bintime		th_bintime;
@@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
 	} while (gen == 0 || gen != th->th_generation);
 }
 #else /* !FFCLOCK */
+
+static void
+bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
+{
+	uint64_t x;
+
+	x = (*scale >> 32) * delta;
+	*scale &= 0xffffffff;
+	bt->sec += x >> 32;
+	bintime_addx(bt, x << 32);
+}
+
 void
 binuptime(struct bintime *bt)
 {
 	struct timehands *th;
-	u_int gen;
+	uint64_t scale;
+	u_int delta, gen;
 
 	do {
 		th = timehands;
 		gen = atomic_load_acq_int(&th->th_generation);
 		*bt = th->th_offset;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
+		scale = th->th_scale;
+		delta = tc_delta(th);
+#ifdef _LP64
+		/* Avoid overflow for scale * delta. */
+		if (__predict_false(th->th_large_delta <= delta))
+			bintime_helper(bt, &scale, delta);
+		bintime_addx(bt, scale * delta);
+#else
+		/*
+		 * Avoid both overflow as above and
+		 * (uint64_t, uint32_t) -> uint64_t
+		 * multiplication on 32bit arches.
+		 */
+		bintime_helper(bt, &scale, delta);
+		bintime_addx(bt, (uint32_t)scale * delta);
+#endif
 		atomic_thread_fence_acq();
 	} while (gen == 0 || gen != th->th_generation);
 }
@@ -388,13 +416,29 @@ void
 bintime(struct bintime *bt)
 {
 	struct timehands *th;
-	u_int gen;
+	uint64_t scale;
+	u_int delta, gen;
 
 	do {
 		th = timehands;
 		gen = atomic_load_acq_int(&th->th_generation);
 		*bt = th->th_bintime;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
+		scale = th->th_scale;
+		delta = tc_delta(th);
+#ifdef _LP64
+		/* Avoid overflow for scale * delta. */
+		if (__predict_false(th->th_large_delta <= delta))
+			bintime_helper(bt, &scale, delta);
+		bintime_addx(bt, scale * delta);
+#else
+		/*
+		 * Avoid both overflow as above and
+		 * (uint64_t, uint32_t) -> uint64_t
+		 * multiplication on 32bit arches.
+		 */
+		bintime_helper(bt, &scale, delta);
+		bintime_addx(bt, (uint32_t)scale * delta);
+#endif
 		atomic_thread_fence_acq();
 	} while (gen == 0 || gen != th->th_generation);
 }
@@ -1464,6 +1508,7 @@ tc_windup(struct bintime *new_boottimebin)
 	scale += (th->th_adjustment / 1024) * 2199;
 	scale /= th->th_counter->tc_frequency;
 	th->th_scale = scale * 2;
+	th->th_large_delta = ((uint64_t)1 << 63) / scale;
 
 	/*
 	 * Now that the struct timehands is again consistent, set the new