powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Konstantin Belousov
kostikbel at gmail.com
Sun Mar 3 11:19:45 UTC 2019
On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
>
> > On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>> ...
> >>> So I am able to reproduce it with some surprising ease on HPET running
> >>> on Haswell.
> >>
> >> So what is the cause of it? Maybe the tickless code doesn't generate
> >> fake clock ticks right. Or it is just a library bug. The kernel has
> >> to be slightly real-time to satisfy the requirement of 1 update per.
> >> Applications are further from being real-time. But isn't it enough
> >> for the kernel to ensure that the timehands cycle more than once per
> >> second?
> > No, I entered ddb as you suggested.
>
> But using ddb is not normal. It is convenient that this fixes HPET and
> ACPI timecounters after using ddb, but this method doesn't help for
> timecounters that wrap fast. TSC-low at 2GHz wraps in 2 seconds, and
> i8254 wraps in a few milliseconds.
>
> >> I don't changing this at all this. binuptime() was carefully written
> >> to not need so much 64-bit arithmetic.
> >>
> >> If this pessimization is allowed, then it can also handle a 64-bit
> >> deltas. Using the better kernel method:
> >>
> >> if (__predict_false(delta >= th->th_large_delta)) {
> >> bt->sec += (scale >> 32) * (delta >> 32);
> >> x = (scale >> 32) * (delta & 0xffffffff);
> >> bt->sec += x >> 32;
> >> bintime_addx(bt, x << 32);
> >> x = (scale & 0xffffffff) * (delta >> 32);
> >> bt->sec += x >> 32;
> >> bintime_addx(bt, x << 32);
> >> bintime_addx(bt, (scale & 0xffffffff) *
> >> (delta & 0xffffffff));
> >> } else
> >> bintime_addx(bt, scale * (delta & 0xffffffff));
> > This only makes sense if delta is extended to uint64_t, which requires
> > the pass over timecounters.
>
> Yes, that was its point. It is a bit annoying to have a hardware
> timecounter like the TSC that doesn't wrap naturally, but then make it
> wrap by masking high bits.
>
> The masking step is also a bit wasteful. For the TSC, it is 1 step to
> discard high bids at the register level, then another step to apply the
> nask to discard th high bits again.
rdtsc-low is implemented in the natural way, after RDTSC, no register
combining into 64bit value is done, instead shrd operates on %edx:%eax
to get the final result into %eax. I am not sure what you refer to.
>
> >> I just noticed that there is a 64 x 32 -> 64 bit multiplication in the
> >> current method. This can be changed to do expicit 32 x 32 -> 64 bit
> >> multiplications and fix the overflow problem at small extra cost on
> >> 32-bit arches:
> >>
> >> /* 32-bit arches did the next multiplication implicitly. */
> >> x = (scale >> 32) * delta;
> >> /*
> >> * And they did the following shifts and most of the adds
> >> * implicitly too. Except shifting x left by 32 lost the
> >> * seconds part that the next line handles. The next line
> >> * is the only extra cost for them.
> >> */
> >> bt->sec += x >> 32;
> >> bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta);
> >
> > Ok, what about the following.
>
> I'm not sure that I really want this, even if the pessimization is done.
> But it avoids using fls*(), so is especially good for 32-bit systems and
> OK for 64-bit systems too, especially in userland where fls*() is in the
> fast path.
For userland I looked at the generated code, and BSR usage seems to be
good enough, for default compilation settings with clang.
>
> >
> > diff --git a/lib/libc/sys/__vdso_gettimeofday.c b/lib/libc/sys/__vdso_gettimeofday.c
> > index 3749e0473af..cfe3d96d001 100644
> > --- a/lib/libc/sys/__vdso_gettimeofday.c
> > +++ b/lib/libc/sys/__vdso_gettimeofday.c
> > @@ -32,6 +32,8 @@ __FBSDID("$FreeBSD$");
> > #include <sys/time.h>
> > #include <sys/vdso.h>
> > #include <errno.h>
> > +#include <limits.h>
>
> Not needed with 0xffffffff instead of UINT_MAX.
>
> The userland part is otherwise little changed.
Yes, see above. If ABI for shared page going to be changed in some future,
I will export th_large_delta as well.
>
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 2656fb4d22f..2e28f872229 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> > ...
> > @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp)
> > } while (gen == 0 || gen != th->th_generation);
> > }
> > #else /* !FFCLOCK */
> > +
> > +static void
> > +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> > +{
> > + uint64_t x;
> > +
> > + x = (*scale >> 32) * delta;
> > + *scale &= 0xffffffff;
> > + bt->sec += x >> 32;
> > + bintime_addx(bt, x << 32);
> > +}
>
> It is probably best to not inline the slow path, but clang tends to
> inline everything anyway.
It does not matter if it inlines it, as far as it is moved out of the
linear sequence for the fast path.
>
> I prefer my way of writing this in 3 lines. Modifying 'scale' for
> the next step is especially ugly and pessimal when the next step is
> in the caller and this function is not inlined.
Can you show exactly what do you want ?
>
> > +
> > void
> > binuptime(struct bintime *bt)
> > {
> > struct timehands *th;
> > - u_int gen;
> > + uint64_t scale;
> > + u_int delta, gen;
> >
> > do {
> > th = timehands;
> > gen = atomic_load_acq_int(&th->th_generation);
> > *bt = th->th_offset;
> > - bintime_addx(bt, th->th_scale * tc_delta(th));
> > + scale = th->th_scale;
> > + delta = tc_delta(th);
> > +#ifdef _LP64
> > + /* Avoid overflow for scale * delta. */
> > + if (__predict_false(th->th_large_delta <= delta))
> > + bintime_helper(bt, &scale, delta);
> > + bintime_addx(bt, scale * delta);
> > +#else
> > + /*
> > + * Also avoid (uint64_t, uint32_t) -> uint64_t
> > + * multiplication on 32bit arches.
> > + */
>
> "Also avoid overflow for ..."
>
> > + bintime_helper(bt, &scale, delta);
> > + bintime_addx(bt, (u_int)scale * delta);
>
> The cast should be to uint32_t, but better write it as & 0xffffffff as
> elsewhere.
>
> bintime_helper() already reduced 'scale' to 32 bits. The cast might be
> needed to tell the compiler this, especially when the function is not
> inlined. Better not do it in the function. The function doesn't even
> use the reduced value.
I used cast to use 32x32 multiplication. I am not sure that all (or any)
compilers are smart enough to deduce that they can use 32 bit mul.
>
> bintime_helper() is in the fast path in this case, so should be inlined.
>
> > +#endif
> > atomic_thread_fence_acq();
> > } while (gen == 0 || gen != th->th_generation);
> > }
>
> This needs lots of testing of course.
Current kernel-only part of the change is below, see the question about
your preference for binuptime_helper().
diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 2656fb4d22f..6c41ab22288 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -72,6 +71,7 @@ struct timehands {
struct timecounter *th_counter;
int64_t th_adjustment;
uint64_t th_scale;
+ uint64_t th_large_delta;
u_int th_offset_count;
struct bintime th_offset;
struct bintime th_bintime;
@@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
} while (gen == 0 || gen != th->th_generation);
}
#else /* !FFCLOCK */
+
+static void
+bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
+{
+ uint64_t x;
+
+ x = (*scale >> 32) * delta;
+ *scale &= 0xffffffff;
+ bt->sec += x >> 32;
+ bintime_addx(bt, x << 32);
+}
+
void
binuptime(struct bintime *bt)
{
struct timehands *th;
- u_int gen;
+ uint64_t scale;
+ u_int delta, gen;
do {
th = timehands;
gen = atomic_load_acq_int(&th->th_generation);
*bt = th->th_offset;
- bintime_addx(bt, th->th_scale * tc_delta(th));
+ scale = th->th_scale;
+ delta = tc_delta(th);
+#ifdef _LP64
+ /* Avoid overflow for scale * delta. */
+ if (__predict_false(th->th_large_delta <= delta))
+ bintime_helper(bt, &scale, delta);
+ bintime_addx(bt, scale * delta);
+#else
+ /*
+ * Avoid both overflow as above and
+ * (uint64_t, uint32_t) -> uint64_t
+ * multiplication on 32bit arches.
+ */
+ bintime_helper(bt, &scale, delta);
+ bintime_addx(bt, (uint32_t)scale * delta);
+#endif
atomic_thread_fence_acq();
} while (gen == 0 || gen != th->th_generation);
}
@@ -388,13 +416,29 @@ void
bintime(struct bintime *bt)
{
struct timehands *th;
- u_int gen;
+ uint64_t scale;
+ u_int delta, gen;
do {
th = timehands;
gen = atomic_load_acq_int(&th->th_generation);
*bt = th->th_bintime;
- bintime_addx(bt, th->th_scale * tc_delta(th));
+ scale = th->th_scale;
+ delta = tc_delta(th);
+#ifdef _LP64
+ /* Avoid overflow for scale * delta. */
+ if (__predict_false(th->th_large_delta <= delta))
+ bintime_helper(bt, &scale, delta);
+ bintime_addx(bt, scale * delta);
+#else
+ /*
+ * Avoid both overflow as above and
+ * (uint64_t, uint32_t) -> uint64_t
+ * multiplication on 32bit arches.
+ */
+ bintime_helper(bt, &scale, delta);
+ bintime_addx(bt, (uint32_t)scale * delta);
+#endif
atomic_thread_fence_acq();
} while (gen == 0 || gen != th->th_generation);
}
@@ -1464,6 +1508,7 @@ tc_windup(struct bintime *new_boottimebin)
scale += (th->th_adjustment / 1024) * 2199;
scale /= th->th_counter->tc_frequency;
th->th_scale = scale * 2;
+ th->th_large_delta = ((uint64_t)1 << 63) / scale;
/*
* Now that the struct timehands is again consistent, set the new
More information about the freebsd-hackers
mailing list