powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Wed Mar 6 17:20:15 UTC 2019

On Tue, Mar 05, 2019 at 05:17:14AM +1100, Bruce Evans wrote:
> On Mon, 4 Mar 2019, Konstantin Belousov wrote:
> 
> > On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote:
> >> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> >>
> >>> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
> >>>> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> >>>>
> >>>>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> >>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>>>>>
> >>>>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >>>>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>> * ...
> >>>> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
> >>>> step.  i386 used to be faster here -- the first masking step of discarding
> >>>> %edx doesn't take any code.  amd64 has to mask out the top bits in %rax.
> >>>> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
> >>>> has to do a not so slow shr.
> >>> i386 cannot discard %edx after RDTSC since some bits from %edx come into
> >>> the timecounter value.
> >>
> >> These bits are part of the tsc-low pessimization.  The shift count should
> >> always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX
> >> sometimes.
> >>
> >> When tsc-low was new, the shift count was often larger (as much as 8),
> >> and it is still changeable by a read-only tunable, but now it is 1 in
> >> almost all cases.  The code only limits the timecounter frequency
> >> to UINT_MAX, except the tunable defaults to 1 so average CPUs running
> >> at nearly 4 GHz are usually limited to about 2 GHz.  The comment about
> >> this UINT_MAX doesn't match the code.  The comment says int, but the
> >> code says UINT.
> >>
> >> All that a shoft count of 1 does is waste time to lose 1 bit of accuracy.
> >> This much accuracy is noise for most purposes.
> >>
> >> The tunable is fairly undocumented.  Its description is "Shift to apply
> >> for the maximum TSC frequency".  Of course, it has no effect on the TSC
> >> frequency.  It only affects the TSC timecounter frequency.
> > I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
> > Otherwise, I think, some multi-socket machines would start showing the
> > detectable backward-counting bintime().  At the frequencies at 4GHz and
> > above (Intel has 5Ghz part numbers) I do not think that stability of
> > 100MHz crystall and on-board traces is enough to avoid that.
> 
> I think it is just a kludge that reduced the problem before it was fixed
> properly using fences.
> 
> Cross-socket latency is over 100 cycles according to jhb's tscskew
> benchmark: on Haswell 4x2:
> 
> CPU | TSC skew (min/avg/max/stddev)
> ----+------------------------------
>    0 |     0     0     0    0.000
>    1 |    24    49    84   14.353
>    2 |   164   243   308   47.811
>    3 |   164   238   312   47.242
>    4 |   168   242   332   49.593
>    5 |   168   243   324   48.722
>    6 |   172   242   320   52.596
>    7 |   172   240   316   53.014
> 
> freefall is similar.  Latency is apparently measured relative to CPU 0.
> It is much lower to CPU 1 since that is on the same core.
> 
> I played with this program a lot 3 and a half years ago, but forgot
> mist of what I learned :-(.  I tried different fencing in it.  This
> seems to make little difference when the program is rerun.  With the
> default TESTS = 1024, the min skew sometimes goes negative on freefall,
> but with TESTS = 1024000 that doesn't happen.  This is the opposite
> of what I would expect.  freefall has load average about 1.
> 
> Removing the only fencing in it reduces average latency by 10-20 cycles
> and minimum latency by over 100 cycles, except on freefall it is
> reduced from 33 to 6.  On Haswell it is 24 with fencing and I didn't
> test it with no fencing.
> 
> I think tscskew doesn't really measure tsc skew.  What it measures is
> the time taken for a locking protocol, using the TSCs on different
> CPUs to make the start and end timestamps.  If the TSCs have a lot of
> skew or jitter, then this will show up indirectly as inconsistent and
> possibly negative differences.
> 
> A shift of just 1 can't hide latencies of hundreds of cycles on single-
> socket machines.  Even a shift of 8 only works sometimes, by reducing
> the chance of observing the TSC going backwards by a factor of 256.
> E.g., assume for simplicity that all instructions and IPCs take 0-1
> cycles, and that unfenced rdtsc's differ by at most +-5 cycles (with
> the 11 values between -5 and 5 uniformly distributed.  Then with a
> shift of 0 and no fences, a CPU that updates the timehands is ahead of
> another CPU that spins reading the timehands about 5/11 of the time.
> With a shift of 8, the CPUs are close enough when the first one reads
> at least 5 above and at least 5 below a 256-boundary.  The chance of
> seeing a negative difference is reduced by at least a factor of 10/256.
> 
> > I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
> > Otherwise, I think, some multi-socket machines would start showing the
> > detectable backward-counting bintime().  At the frequencies at 4GHz and
> > above (Intel has 5Ghz part numbers) I do not think that stability of
> > 100MHz crystall and on-board traces is enough to avoid that.
> 
> Why would losing just 1 bit fix that?
> 
> Fences for rdtsc of course only serialize it for the CPU that runs it.
> The locking (ordering) protocol (for the generation count) orders the
> CPUs too.  It takes longer than we would like, much more than the 1-
> cycle error that might be hidden by ignoring the low bit.  Surely the
> ordering protocol must work across sockets?  It then gives ordering of
> rdtsc's.
> 
> TSC-low was added in 2011.  That was long before the ordering was fixed.
> You added fences in 2012 and memory ordering for the generation count in
> 2016.  Fences slowed everything down by 10-20+ cycles and probably hide
> bugs in the memory ordering better than TSC-low.  Memory ordering plus
> fences slow down the cross-core case by more than 100 cycles according
> to tscskew.  That is enough to hide large hardware bugs.
> 
> > We can try to set the tsc-low shift count to 0 (but keep lfence) and see
> > what is going on in HEAD, but I am afraid that the HEAD users population
> > is not representative enough to catch the issue with the certainity.
> > More, it is unclear to me how to diagnose the cause, e.g. I would expect
> > the sleeps to hang on timeouts, as was reported from the very beginning
> > of this thread. How would we root-cause it ?
> 
> Negative time differences cause lots of overflows so break the timecounter.
> The fix under discussion actually gives larger overflows in the positive
> direction.  E.g., a delta of -1 first overflows to 0xffffffff.  The fix
> prevents overflow on multiplication by that.  When the timecounter
> frequency is small, say 1 MHz, 0xffffffff means 4294 seconds, so the
> timecounter advances by that.
> 
> >>> amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
> >>> since the whole shrdl calculation occurs in 32bit registers, and the result
> >>> is in %rax where top word is cleared by shrdl instruction automatically.
> >>> But the clearing is not required since result is unsigned int anyway.
> >>>
> >>> Dissassemble of tsc_get_timecount_low() is very clear:
> >>>   0xffffffff806767e4 <+4>:     mov    0x30(%rdi),%ecx
> >>>   0xffffffff806767e7 <+7>:     rdtsc
> >>>   0xffffffff806767e9 <+9>:     shrd   %cl,%edx,%eax
> >>> ...
> >>>   0xffffffff806767ed <+13>:    retq
> >>> (I removed frame manipulations).
> 
> I checked that all compilers still produce horrible code for the better
> source code 'return (rdtsc() << (intptr_t)tc->tc_priv);'.  64-bit shifts
> are apparently pessimal for compatibility.  The above is written mostly
> in asm to avoid 2-5 extra instructions.
> 
> >>>> ...
> >>>> Similarly in bintime().
> >>> I merged two functions, finally.  Having to copy the same code is too
> >>> annoying for this change.
> 
> I strongly disklike the merge.
> 
> >>> So I verified that:
> >>> - there is no 64bit multiplication in the generated code, for i386 both
> >>>  for clang 7.0 and gcc 8.3;
> >>> - that everything is inlined, the only call from bintime/binuptime is
> >>>  the indirect call to get the timecounter value.
> >>
> >> I will have to fix it for compilers that I use.
> > Ok, I will add __inline.
> 
> That will make it fast enough, but still hard to read.
> 
> >>> +		*bt = *bts;
> >>> +		scale = th->th_scale;
> >>> +		delta = tc_delta(th);
> >>> +#ifdef _LP64
> >>> +		if (__predict_false(th->th_large_delta <= delta)) {
> >>> +			/* Avoid overflow for scale * delta. */
> >>> +			bintime_helper(bt, scale, delta);
> >>> +			bintime_addx(bt, (scale & 0xffffffff) * delta);
> >>> +		} else {
> >>> +			bintime_addx(bt, scale * delta);
> >>> +		}
> >>> +#else
> >>> +		/*
> >>> +		 * Use bintime_helper() unconditionally, since the fast
> >>> +		 * path in the above method is not so fast here, since
> >>> +		 * the 64 x 32 -> 64 bit multiplication is usually not
> >>> +		 * available in hardware and emulating it using 2
> >>> +		 * 32 x 32 -> 64 bit multiplications uses code much
> >>> +		 * like that in bintime_helper().
> >>> +		 */
> >>> +		bintime_helper(bt, scale, delta);
> >>> +		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
> >>> +#endif
> >>
> >> Check that this method is really better.  Without this, the complicated
> >> part is about half as large and duplicating it is smaller than this
> >> version.
> > Better in what sence ?  I am fine with the C code, and asm code looks
> > good.
> 
> Better in terms of actually running significantly faster.  I fear the
> 32-bit method is actually slightly slower for the fast path.
> 
> >>> -	do {
> >>> -		th = timehands;
> >>> -		gen = atomic_load_acq_int(&th->th_generation);
> >>> -		*bt = th->th_bintime;
> >>> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> >>> -		atomic_thread_fence_acq();
> >>> -	} while (gen == 0 || gen != th->th_generation);
> >>
> >> Duplicating this loop is much better than obfuscating it using inline
> >> functions.  This loop was almost duplicated (except for the delta
> >> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
> >> 8 fflock ones).  Now it is only duplicated 16 times.
> > How did you counted the 16 ?  I can see only 4 instances in the unpatched
> > kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not
> > touch ffclock until the patch is finalized.  After that, it would be
> > 1 instance for kernel and 1 for userspace.
> 
> Grep for the end condition in this loop.  There are actually 20 of these.
> I'm counting the loops and not the previously-simple scaling operation in
> it.  The scaling is indeed only done for 4 cases.  I prefer the 20 
> duplications (except I only want about 6 of the functions).  Duplication
> works even better for only 4 cases.
Ok, I merged these as well.  Now there are only four loops left in kernel.
I do not think that merging them is beneficial, since they have sufficiently
different bodies.

I disagree with you characterization of it as obfuscation, IMO it improves
the maintainability of the code by reducing number of places which need
careful inspection of the lock-less algorithm.

> 
> This should be written as a function call to 1 new function to replace
> the line with the overflowing multiplication.  The line is always the
> same, so the new function call can look like bintime_xxx(bt, th).
Again, please provide at least of a pseudocode of your preference.

The current patch becomes to large already, I want to test/commit what
I already have, and I will need to split it for the commit.

diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 2656fb4d22f..7114a0e5219 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -72,6 +72,7 @@ struct timehands {
 	struct timecounter	*th_counter;
 	int64_t			th_adjustment;
 	uint64_t		th_scale;
+	uint64_t		th_large_delta;
 	u_int	 		th_offset_count;
 	struct bintime		th_offset;
 	struct bintime		th_bintime;
@@ -200,22 +201,77 @@ tc_delta(struct timehands *th)
  * the comment in <sys/time.h> for a description of these 12 functions.
  */
 
-#ifdef FFCLOCK
-void
-fbclock_binuptime(struct bintime *bt)
+static __inline void
+bintime_helper(struct bintime *bt, uint64_t scale, u_int delta)
+{
+	uint64_t x;
+
+	x = (scale >> 32) * delta;
+	bt->sec += x >> 32;
+	bintime_addx(bt, x << 32);
+}
+
+static __inline void
+binnouptime(struct bintime *bt, u_int off)
 {
 	struct timehands *th;
-	unsigned int gen;
+	struct bintime *bts;
+	uint64_t scale;
+	u_int delta, gen;
 
 	do {
 		th = timehands;
 		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
+		bts = (struct bintime *)(vm_offset_t)th + off;
+		*bt = *bts;
+		scale = th->th_scale;
+		delta = tc_delta(th);
+#ifdef _LP64
+		if (__predict_false(th->th_large_delta <= delta)) {
+			/* Avoid overflow for scale * delta. */
+			bintime_helper(bt, scale, delta);
+			bintime_addx(bt, (scale & 0xffffffff) * delta);
+		} else {
+			bintime_addx(bt, scale * delta);
+		}
+#else
+		/*
+		 * Use bintime_helper() unconditionally, since the fast
+		 * path in the above method is not so fast here, since
+		 * the 64 x 32 -> 64 bit multiplication is usually not
+		 * available in hardware and emulating it using 2
+		 * 32 x 32 -> 64 bit multiplications uses code much
+		 * like that in bintime_helper().
+		 */
+		bintime_helper(bt, scale, delta);
+		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
+#endif
 		atomic_thread_fence_acq();
 	} while (gen == 0 || gen != th->th_generation);
 }
 
+static __inline void
+getbinnouptime(void *out, size_t out_size, u_int off)
+{
+	struct timehands *th;
+	u_int gen;
+
+	do {
+		th = timehands;
+		gen = atomic_load_acq_int(&th->th_generation);
+		memcpy(out, (char *)th + off, out_size);
+		atomic_thread_fence_acq();
+	} while (gen == 0 || gen != th->th_generation);
+}
+
+#ifdef FFCLOCK
+void
+fbclock_binuptime(struct bintime *bt)
+{
+
+	binnouptime(bt, __offsetof(struct timehands, th_offset));
+}
+
 void
 fbclock_nanouptime(struct timespec *tsp)
 {
@@ -237,16 +293,8 @@ fbclock_microuptime(struct timeval *tvp)
 void
 fbclock_bintime(struct bintime *bt)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	binnouptime(bt, __offsetof(struct timehands, th_bintime));
 }
 
 void
@@ -270,100 +318,61 @@ fbclock_microtime(struct timeval *tvp)
 void
 fbclock_getbinuptime(struct bintime *bt)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+	    th_offset));
 }
 
 void
 fbclock_getnanouptime(struct timespec *tsp)
 {
-	struct timehands *th;
-	unsigned int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timespec(&th->th_offset, tsp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+	    th_offset));
+	bintime2timespec(&bt, tsp);
 }
 
 void
 fbclock_getmicrouptime(struct timeval *tvp)
 {
-	struct timehands *th;
-	unsigned int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timeval(&th->th_offset, tvp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+	    th_offset));
+	bintime2timeval(&bt, tvp);
 }
 
 void
 fbclock_getbintime(struct bintime *bt)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+	    th_bintime));
 }
 
 void
 fbclock_getnanotime(struct timespec *tsp)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tsp = th->th_nanotime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands,
+	    th_nanotime));
 }
 
 void
 fbclock_getmicrotime(struct timeval *tvp)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tvp = th->th_microtime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(tvp, sizeof(*tvp), __offsetof(struct timehands,
+	    th_microtime));
 }
 #else /* !FFCLOCK */
+
 void
 binuptime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	binnouptime(bt, __offsetof(struct timehands, th_offset));
 }
 
 void
@@ -387,16 +396,8 @@ microuptime(struct timeval *tvp)
 void
 bintime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	binnouptime(bt, __offsetof(struct timehands, th_bintime));
 }
 
 void
@@ -420,85 +421,53 @@ microtime(struct timeval *tvp)
 void
 getbinuptime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+	    th_offset));
 }
 
 void
 getnanouptime(struct timespec *tsp)
 {
-	struct timehands *th;
-	u_int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timespec(&th->th_offset, tsp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+	    th_offset));
+	bintime2timespec(&bt, tsp);
 }
 
 void
 getmicrouptime(struct timeval *tvp)
 {
-	struct timehands *th;
-	u_int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timeval(&th->th_offset, tvp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+	    th_offset));
+	bintime2timeval(&bt, tvp);
 }
 
 void
 getbintime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+	    th_bintime));
 }
 
 void
 getnanotime(struct timespec *tsp)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tsp = th->th_nanotime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands,
+	    th_nanotime));
 }
 
 void
 getmicrotime(struct timeval *tvp)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tvp = th->th_microtime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(tvp, sizeof(*tvp), __offsetof(struct timehands,
+	    th_microtime));
 }
 #endif /* FFCLOCK */
 
@@ -514,15 +483,9 @@ getboottime(struct timeval *boottime)
 void
 getboottimebin(struct bintime *boottimebin)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*boottimebin = th->th_boottime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(boottimebin, sizeof(*boottimebin),
+	    __offsetof(struct timehands, th_boottime));
 }
 
 #ifdef FFCLOCK
@@ -1038,15 +1001,9 @@ getmicrotime(struct timeval *tvp)
 void
 dtrace_getnanotime(struct timespec *tsp)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tsp = th->th_nanotime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands,
+	    th_nanotime));
 }
 
 /*
@@ -1464,6 +1421,7 @@ tc_windup(struct bintime *new_boottimebin)
 	scale += (th->th_adjustment / 1024) * 2199;
 	scale /= th->th_counter->tc_frequency;
 	th->th_scale = scale * 2;
+	th->th_large_delta = ((uint64_t)1 << 63) / scale;
 
 	/*
 	 * Now that the struct timehands is again consistent, set the new