powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Konstantin Belousov
kostikbel at gmail.com
Wed Mar 6 17:20:15 UTC 2019
On Tue, Mar 05, 2019 at 05:17:14AM +1100, Bruce Evans wrote:
> On Mon, 4 Mar 2019, Konstantin Belousov wrote:
>
> > On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote:
> >> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> >>
> >>> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
> >>>> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> >>>>
> >>>>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> >>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>>>>>
> >>>>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >>>>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>> * ...
> >>>> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
> >>>> step. i386 used to be faster here -- the first masking step of discarding
> >>>> %edx doesn't take any code. amd64 has to mask out the top bits in %rax.
> >>>> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
> >>>> has to do a not so slow shr.
> >>> i386 cannot discard %edx after RDTSC since some bits from %edx come into
> >>> the timecounter value.
> >>
> >> These bits are part of the tsc-low pessimization. The shift count should
> >> always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX
> >> sometimes.
> >>
> >> When tsc-low was new, the shift count was often larger (as much as 8),
> >> and it is still changeable by a read-only tunable, but now it is 1 in
> >> almost all cases. The code only limits the timecounter frequency
> >> to UINT_MAX, except the tunable defaults to 1 so average CPUs running
> >> at nearly 4 GHz are usually limited to about 2 GHz. The comment about
> >> this UINT_MAX doesn't match the code. The comment says int, but the
> >> code says UINT.
> >>
> >> All that a shoft count of 1 does is waste time to lose 1 bit of accuracy.
> >> This much accuracy is noise for most purposes.
> >>
> >> The tunable is fairly undocumented. Its description is "Shift to apply
> >> for the maximum TSC frequency". Of course, it has no effect on the TSC
> >> frequency. It only affects the TSC timecounter frequency.
> > I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
> > Otherwise, I think, some multi-socket machines would start showing the
> > detectable backward-counting bintime(). At the frequencies at 4GHz and
> > above (Intel has 5Ghz part numbers) I do not think that stability of
> > 100MHz crystall and on-board traces is enough to avoid that.
>
> I think it is just a kludge that reduced the problem before it was fixed
> properly using fences.
>
> Cross-socket latency is over 100 cycles according to jhb's tscskew
> benchmark: on Haswell 4x2:
>
> CPU | TSC skew (min/avg/max/stddev)
> ----+------------------------------
> 0 | 0 0 0 0.000
> 1 | 24 49 84 14.353
> 2 | 164 243 308 47.811
> 3 | 164 238 312 47.242
> 4 | 168 242 332 49.593
> 5 | 168 243 324 48.722
> 6 | 172 242 320 52.596
> 7 | 172 240 316 53.014
>
> freefall is similar. Latency is apparently measured relative to CPU 0.
> It is much lower to CPU 1 since that is on the same core.
>
> I played with this program a lot 3 and a half years ago, but forgot
> mist of what I learned :-(. I tried different fencing in it. This
> seems to make little difference when the program is rerun. With the
> default TESTS = 1024, the min skew sometimes goes negative on freefall,
> but with TESTS = 1024000 that doesn't happen. This is the opposite
> of what I would expect. freefall has load average about 1.
>
> Removing the only fencing in it reduces average latency by 10-20 cycles
> and minimum latency by over 100 cycles, except on freefall it is
> reduced from 33 to 6. On Haswell it is 24 with fencing and I didn't
> test it with no fencing.
>
> I think tscskew doesn't really measure tsc skew. What it measures is
> the time taken for a locking protocol, using the TSCs on different
> CPUs to make the start and end timestamps. If the TSCs have a lot of
> skew or jitter, then this will show up indirectly as inconsistent and
> possibly negative differences.
>
> A shift of just 1 can't hide latencies of hundreds of cycles on single-
> socket machines. Even a shift of 8 only works sometimes, by reducing
> the chance of observing the TSC going backwards by a factor of 256.
> E.g., assume for simplicity that all instructions and IPCs take 0-1
> cycles, and that unfenced rdtsc's differ by at most +-5 cycles (with
> the 11 values between -5 and 5 uniformly distributed. Then with a
> shift of 0 and no fences, a CPU that updates the timehands is ahead of
> another CPU that spins reading the timehands about 5/11 of the time.
> With a shift of 8, the CPUs are close enough when the first one reads
> at least 5 above and at least 5 below a 256-boundary. The chance of
> seeing a negative difference is reduced by at least a factor of 10/256.
>
> > I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
> > Otherwise, I think, some multi-socket machines would start showing the
> > detectable backward-counting bintime(). At the frequencies at 4GHz and
> > above (Intel has 5Ghz part numbers) I do not think that stability of
> > 100MHz crystall and on-board traces is enough to avoid that.
>
> Why would losing just 1 bit fix that?
>
> Fences for rdtsc of course only serialize it for the CPU that runs it.
> The locking (ordering) protocol (for the generation count) orders the
> CPUs too. It takes longer than we would like, much more than the 1-
> cycle error that might be hidden by ignoring the low bit. Surely the
> ordering protocol must work across sockets? It then gives ordering of
> rdtsc's.
>
> TSC-low was added in 2011. That was long before the ordering was fixed.
> You added fences in 2012 and memory ordering for the generation count in
> 2016. Fences slowed everything down by 10-20+ cycles and probably hide
> bugs in the memory ordering better than TSC-low. Memory ordering plus
> fences slow down the cross-core case by more than 100 cycles according
> to tscskew. That is enough to hide large hardware bugs.
>
> > We can try to set the tsc-low shift count to 0 (but keep lfence) and see
> > what is going on in HEAD, but I am afraid that the HEAD users population
> > is not representative enough to catch the issue with the certainity.
> > More, it is unclear to me how to diagnose the cause, e.g. I would expect
> > the sleeps to hang on timeouts, as was reported from the very beginning
> > of this thread. How would we root-cause it ?
>
> Negative time differences cause lots of overflows so break the timecounter.
> The fix under discussion actually gives larger overflows in the positive
> direction. E.g., a delta of -1 first overflows to 0xffffffff. The fix
> prevents overflow on multiplication by that. When the timecounter
> frequency is small, say 1 MHz, 0xffffffff means 4294 seconds, so the
> timecounter advances by that.
>
> >>> amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
> >>> since the whole shrdl calculation occurs in 32bit registers, and the result
> >>> is in %rax where top word is cleared by shrdl instruction automatically.
> >>> But the clearing is not required since result is unsigned int anyway.
> >>>
> >>> Dissassemble of tsc_get_timecount_low() is very clear:
> >>> 0xffffffff806767e4 <+4>: mov 0x30(%rdi),%ecx
> >>> 0xffffffff806767e7 <+7>: rdtsc
> >>> 0xffffffff806767e9 <+9>: shrd %cl,%edx,%eax
> >>> ...
> >>> 0xffffffff806767ed <+13>: retq
> >>> (I removed frame manipulations).
>
> I checked that all compilers still produce horrible code for the better
> source code 'return (rdtsc() << (intptr_t)tc->tc_priv);'. 64-bit shifts
> are apparently pessimal for compatibility. The above is written mostly
> in asm to avoid 2-5 extra instructions.
>
> >>>> ...
> >>>> Similarly in bintime().
> >>> I merged two functions, finally. Having to copy the same code is too
> >>> annoying for this change.
>
> I strongly disklike the merge.
>
> >>> So I verified that:
> >>> - there is no 64bit multiplication in the generated code, for i386 both
> >>> for clang 7.0 and gcc 8.3;
> >>> - that everything is inlined, the only call from bintime/binuptime is
> >>> the indirect call to get the timecounter value.
> >>
> >> I will have to fix it for compilers that I use.
> > Ok, I will add __inline.
>
> That will make it fast enough, but still hard to read.
>
> >>> + *bt = *bts;
> >>> + scale = th->th_scale;
> >>> + delta = tc_delta(th);
> >>> +#ifdef _LP64
> >>> + if (__predict_false(th->th_large_delta <= delta)) {
> >>> + /* Avoid overflow for scale * delta. */
> >>> + bintime_helper(bt, scale, delta);
> >>> + bintime_addx(bt, (scale & 0xffffffff) * delta);
> >>> + } else {
> >>> + bintime_addx(bt, scale * delta);
> >>> + }
> >>> +#else
> >>> + /*
> >>> + * Use bintime_helper() unconditionally, since the fast
> >>> + * path in the above method is not so fast here, since
> >>> + * the 64 x 32 -> 64 bit multiplication is usually not
> >>> + * available in hardware and emulating it using 2
> >>> + * 32 x 32 -> 64 bit multiplications uses code much
> >>> + * like that in bintime_helper().
> >>> + */
> >>> + bintime_helper(bt, scale, delta);
> >>> + bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
> >>> +#endif
> >>
> >> Check that this method is really better. Without this, the complicated
> >> part is about half as large and duplicating it is smaller than this
> >> version.
> > Better in what sence ? I am fine with the C code, and asm code looks
> > good.
>
> Better in terms of actually running significantly faster. I fear the
> 32-bit method is actually slightly slower for the fast path.
>
> >>> - do {
> >>> - th = timehands;
> >>> - gen = atomic_load_acq_int(&th->th_generation);
> >>> - *bt = th->th_bintime;
> >>> - bintime_addx(bt, th->th_scale * tc_delta(th));
> >>> - atomic_thread_fence_acq();
> >>> - } while (gen == 0 || gen != th->th_generation);
> >>
> >> Duplicating this loop is much better than obfuscating it using inline
> >> functions. This loop was almost duplicated (except for the delta
> >> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
> >> 8 fflock ones). Now it is only duplicated 16 times.
> > How did you counted the 16 ? I can see only 4 instances in the unpatched
> > kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not
> > touch ffclock until the patch is finalized. After that, it would be
> > 1 instance for kernel and 1 for userspace.
>
> Grep for the end condition in this loop. There are actually 20 of these.
> I'm counting the loops and not the previously-simple scaling operation in
> it. The scaling is indeed only done for 4 cases. I prefer the 20
> duplications (except I only want about 6 of the functions). Duplication
> works even better for only 4 cases.
Ok, I merged these as well. Now there are only four loops left in kernel.
I do not think that merging them is beneficial, since they have sufficiently
different bodies.
I disagree with you characterization of it as obfuscation, IMO it improves
the maintainability of the code by reducing number of places which need
careful inspection of the lock-less algorithm.
>
> This should be written as a function call to 1 new function to replace
> the line with the overflowing multiplication. The line is always the
> same, so the new function call can look like bintime_xxx(bt, th).
Again, please provide at least of a pseudocode of your preference.
The current patch becomes to large already, I want to test/commit what
I already have, and I will need to split it for the commit.
diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 2656fb4d22f..7114a0e5219 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -72,6 +72,7 @@ struct timehands {
struct timecounter *th_counter;
int64_t th_adjustment;
uint64_t th_scale;
+ uint64_t th_large_delta;
u_int th_offset_count;
struct bintime th_offset;
struct bintime th_bintime;
@@ -200,22 +201,77 @@ tc_delta(struct timehands *th)
* the comment in <sys/time.h> for a description of these 12 functions.
*/
-#ifdef FFCLOCK
-void
-fbclock_binuptime(struct bintime *bt)
+static __inline void
+bintime_helper(struct bintime *bt, uint64_t scale, u_int delta)
+{
+ uint64_t x;
+
+ x = (scale >> 32) * delta;
+ bt->sec += x >> 32;
+ bintime_addx(bt, x << 32);
+}
+
+static __inline void
+binnouptime(struct bintime *bt, u_int off)
{
struct timehands *th;
- unsigned int gen;
+ struct bintime *bts;
+ uint64_t scale;
+ u_int delta, gen;
do {
th = timehands;
gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_offset;
- bintime_addx(bt, th->th_scale * tc_delta(th));
+ bts = (struct bintime *)(vm_offset_t)th + off;
+ *bt = *bts;
+ scale = th->th_scale;
+ delta = tc_delta(th);
+#ifdef _LP64
+ if (__predict_false(th->th_large_delta <= delta)) {
+ /* Avoid overflow for scale * delta. */
+ bintime_helper(bt, scale, delta);
+ bintime_addx(bt, (scale & 0xffffffff) * delta);
+ } else {
+ bintime_addx(bt, scale * delta);
+ }
+#else
+ /*
+ * Use bintime_helper() unconditionally, since the fast
+ * path in the above method is not so fast here, since
+ * the 64 x 32 -> 64 bit multiplication is usually not
+ * available in hardware and emulating it using 2
+ * 32 x 32 -> 64 bit multiplications uses code much
+ * like that in bintime_helper().
+ */
+ bintime_helper(bt, scale, delta);
+ bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
+#endif
atomic_thread_fence_acq();
} while (gen == 0 || gen != th->th_generation);
}
+static __inline void
+getbinnouptime(void *out, size_t out_size, u_int off)
+{
+ struct timehands *th;
+ u_int gen;
+
+ do {
+ th = timehands;
+ gen = atomic_load_acq_int(&th->th_generation);
+ memcpy(out, (char *)th + off, out_size);
+ atomic_thread_fence_acq();
+ } while (gen == 0 || gen != th->th_generation);
+}
+
+#ifdef FFCLOCK
+void
+fbclock_binuptime(struct bintime *bt)
+{
+
+ binnouptime(bt, __offsetof(struct timehands, th_offset));
+}
+
void
fbclock_nanouptime(struct timespec *tsp)
{
@@ -237,16 +293,8 @@ fbclock_microuptime(struct timeval *tvp)
void
fbclock_bintime(struct bintime *bt)
{
- struct timehands *th;
- unsigned int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_bintime;
- bintime_addx(bt, th->th_scale * tc_delta(th));
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ binnouptime(bt, __offsetof(struct timehands, th_bintime));
}
void
@@ -270,100 +318,61 @@ fbclock_microtime(struct timeval *tvp)
void
fbclock_getbinuptime(struct bintime *bt)
{
- struct timehands *th;
- unsigned int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_offset;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+ th_offset));
}
void
fbclock_getnanouptime(struct timespec *tsp)
{
- struct timehands *th;
- unsigned int gen;
+ struct bintime bt;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- bintime2timespec(&th->th_offset, tsp);
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+ th_offset));
+ bintime2timespec(&bt, tsp);
}
void
fbclock_getmicrouptime(struct timeval *tvp)
{
- struct timehands *th;
- unsigned int gen;
+ struct bintime bt;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- bintime2timeval(&th->th_offset, tvp);
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+ th_offset));
+ bintime2timeval(&bt, tvp);
}
void
fbclock_getbintime(struct bintime *bt)
{
- struct timehands *th;
- unsigned int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_bintime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+ th_bintime));
}
void
fbclock_getnanotime(struct timespec *tsp)
{
- struct timehands *th;
- unsigned int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *tsp = th->th_nanotime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands,
+ th_nanotime));
}
void
fbclock_getmicrotime(struct timeval *tvp)
{
- struct timehands *th;
- unsigned int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *tvp = th->th_microtime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(tvp, sizeof(*tvp), __offsetof(struct timehands,
+ th_microtime));
}
#else /* !FFCLOCK */
+
void
binuptime(struct bintime *bt)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_offset;
- bintime_addx(bt, th->th_scale * tc_delta(th));
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ binnouptime(bt, __offsetof(struct timehands, th_offset));
}
void
@@ -387,16 +396,8 @@ microuptime(struct timeval *tvp)
void
bintime(struct bintime *bt)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_bintime;
- bintime_addx(bt, th->th_scale * tc_delta(th));
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ binnouptime(bt, __offsetof(struct timehands, th_bintime));
}
void
@@ -420,85 +421,53 @@ microtime(struct timeval *tvp)
void
getbinuptime(struct bintime *bt)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_offset;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+ th_offset));
}
void
getnanouptime(struct timespec *tsp)
{
- struct timehands *th;
- u_int gen;
+ struct bintime bt;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- bintime2timespec(&th->th_offset, tsp);
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+ th_offset));
+ bintime2timespec(&bt, tsp);
}
void
getmicrouptime(struct timeval *tvp)
{
- struct timehands *th;
- u_int gen;
+ struct bintime bt;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- bintime2timeval(&th->th_offset, tvp);
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands,
+ th_offset));
+ bintime2timeval(&bt, tvp);
}
void
getbintime(struct bintime *bt)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_bintime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands,
+ th_bintime));
}
void
getnanotime(struct timespec *tsp)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *tsp = th->th_nanotime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands,
+ th_nanotime));
}
void
getmicrotime(struct timeval *tvp)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *tvp = th->th_microtime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(tvp, sizeof(*tvp), __offsetof(struct timehands,
+ th_microtime));
}
#endif /* FFCLOCK */
@@ -514,15 +483,9 @@ getboottime(struct timeval *boottime)
void
getboottimebin(struct bintime *boottimebin)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *boottimebin = th->th_boottime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(boottimebin, sizeof(*boottimebin),
+ __offsetof(struct timehands, th_boottime));
}
#ifdef FFCLOCK
@@ -1038,15 +1001,9 @@ getmicrotime(struct timeval *tvp)
void
dtrace_getnanotime(struct timespec *tsp)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *tsp = th->th_nanotime;
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands,
+ th_nanotime));
}
/*
@@ -1464,6 +1421,7 @@ tc_windup(struct bintime *new_boottimebin)
scale += (th->th_adjustment / 1024) * 2199;
scale /= th->th_counter->tc_frequency;
th->th_scale = scale * 2;
+ th->th_large_delta = ((uint64_t)1 << 63) / scale;
/*
* Now that the struct timehands is again consistent, set the new
More information about the freebsd-hackers
mailing list