powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Konstantin Belousov
kostikbel at gmail.com
Mon Mar 4 11:42:00 UTC 2019
On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote:
> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
>
> > On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
> >> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> >>
> >>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> >>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>>>
> >>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> > * ...
> >>>> Yes, that was its point. It is a bit annoying to have a hardware
> >>>> timecounter like the TSC that doesn't wrap naturally, but then make it
> >>>> wrap by masking high bits.
> >>>>
> >>>> The masking step is also a bit wasteful. For the TSC, it is 1 step to
> >>>> discard high bids at the register level, then another step to apply the
> >>>> nask to discard th high bits again.
> >>> rdtsc-low is implemented in the natural way, after RDTSC, no register
> >>> combining into 64bit value is done, instead shrd operates on %edx:%eax
> >>> to get the final result into %eax. I am not sure what you refer to.
> >>
> >> I was referring mostly to the masking step '& tc->tc_counter_mask' and
> >> the lack of register combining in rdtsc().
> >>
> >> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
> >> step. i386 used to be faster here -- the first masking step of discarding
> >> %edx doesn't take any code. amd64 has to mask out the top bits in %rax.
> >> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
> >> has to do a not so slow shr.
> > i386 cannot discard %edx after RDTSC since some bits from %edx come into
> > the timecounter value.
>
> These bits are part of the tsc-low pessimization. The shift count should
> always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX
> sometimes.
>
> When tsc-low was new, the shift count was often larger (as much as 8),
> and it is still changeable by a read-only tunable, but now it is 1 in
> almost all cases. The code only limits the timecounter frequency
> to UINT_MAX, except the tunable defaults to 1 so average CPUs running
> at nearly 4 GHz are usually limited to about 2 GHz. The comment about
> this UINT_MAX doesn't match the code. The comment says int, but the
> code says UINT.
>
> All that a shoft count of 1 does is waste time to lose 1 bit of accuracy.
> This much accuracy is noise for most purposes.
>
> The tunable is fairly undocumented. Its description is "Shift to apply
> for the maximum TSC frequency". Of course, it has no effect on the TSC
> frequency. It only affects the TSC timecounter frequency.
I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
Otherwise, I think, some multi-socket machines would start showing the
detectable backward-counting bintime(). At the frequencies at 4GHz and
above (Intel has 5Ghz part numbers) I do not think that stability of
100MHz crystall and on-board traces is enough to avoid that.
We can try to set the tsc-low shift count to 0 (but keep lfence) and see
what is going on in HEAD, but I am afraid that the HEAD users population
is not representative enough to catch the issue with the certainity.
More, it is unclear to me how to diagnose the cause, e.g. I would expect
the sleeps to hang on timeouts, as was reported from the very beginning
of this thread. How would we root-cause it ?
>
> The cputicker normally uses the TSC without even an lfence. This use
> only has to be monotonic per-CPU, so this is OK. Also, any bugs hidden
> by discarding low bits shouldn't show up per-CPU. However, keeping
> the cputicker below 4G actually has some efficiency advantages. For
> timecounters, there are no multiplications or divisions by the frequency
> in the fast path, but cputicker use isn't so optimized and it does a
> slow 64-bit division in cputick2usec(). Keeping cpu_tick_freqency
> below UINT_MAX allows dividing by it in integer arithmetic in some cases,
> This optimization is not done.
>
> > amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
> > since the whole shrdl calculation occurs in 32bit registers, and the result
> > is in %rax where top word is cleared by shrdl instruction automatically.
> > But the clearing is not required since result is unsigned int anyway.
> >
> > Dissassemble of tsc_get_timecount_low() is very clear:
> > 0xffffffff806767e4 <+4>: mov 0x30(%rdi),%ecx
> > 0xffffffff806767e7 <+7>: rdtsc
> > 0xffffffff806767e9 <+9>: shrd %cl,%edx,%eax
> > ...
> > 0xffffffff806767ed <+13>: retq
> > (I removed frame manipulations).
>
> It would without the shift pessimization, since the function returns uint32_t
> but rdtsc() gives uint64_t. Removing the top bits is not needed since
> tc_delta() removes them again, but the API doesn't allow expressing this.
>
> Without the shift pessimization, we just do rdtsc() in all cases and don't
> need this function call. I think this is about 5-10 cycles faster after
> some parallelism.
>
> >>>> I prefer my way of writing this in 3 lines. Modifying 'scale' for
> >>>> the next step is especially ugly and pessimal when the next step is
> >>>> in the caller and this function is not inlined.
> >>> Can you show exactly what do you want ?
> >>
> >> Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers,
> >> and don't pass 'scale' indirectly to bintime_helper() and don't modify
> >> it there.
> >>
> >> Oops, there is a problem. 'scale' must be reduced iff bintime_helper()
> >> was used. Duplicate some source code so as to not need a fall-through
> >> to the fast path. See below.
> > Yes, this is the reason why it is passed by pointer (C has no references).
>
> The indirection is slow no matter how it is spelled, unless it is inlined
> away.
>
> >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> >>> index 2656fb4d22f..6c41ab22288 100644
> >>> --- a/sys/kern/kern_tc.c
> >>> +++ b/sys/kern/kern_tc.c
> >>> @@ -72,6 +71,7 @@ struct timehands {
> >>> struct timecounter *th_counter;
> >>> int64_t th_adjustment;
> >>> uint64_t th_scale;
> >>> + uint64_t th_large_delta;
> >>> u_int th_offset_count;
> >>> struct bintime th_offset;
> >>> struct bintime th_bintime;
> >>> @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
> >>> } while (gen == 0 || gen != th->th_generation);
> >>> }
> >>> #else /* !FFCLOCK */
> >>> +
> >>> +static void
> >>
> >> Add __inline. This is in the fast path for 32-bit systems.
> > Compilers do not need this hand-holding, and I prefer to avoid __inline
> > unless really necessary. I checked with both clang 7.0 and gcc 8.3
> > that autoinlining did occured.
>
> But they do. I don't use either of these compilers, and turn of inlining
> as much as possible anyway using -fno-inline -fno-inline-functions-called-
> once (this is very broken in clang -- -fno-inline turns off inlining of
> even functions declared as __inline (like curthread), and clang doesn't
> support -fno-inline -fno-inline-functions-called-once.
>
> >> ...
> >> Similarly in bintime().
> > I merged two functions, finally. Having to copy the same code is too
> > annoying for this change.
> >
> > So I verified that:
> > - there is no 64bit multiplication in the generated code, for i386 both
> > for clang 7.0 and gcc 8.3;
> > - that everything is inlined, the only call from bintime/binuptime is
> > the indirect call to get the timecounter value.
>
> I will have to fix it for compilers that I use.
Ok, I will add __inline.
>
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 2656fb4d22f..0fd39e25058 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> + ...
> > +static void
> > +binnouptime(struct bintime *bt, u_int off)
> > {
> > struct timehands *th;
> > - u_int gen;
> > + struct bintime *bts;
> > + uint64_t scale;
> > + u_int delta, gen;
> >
> > do {
> > th = timehands;
> > gen = atomic_load_acq_int(&th->th_generation);
> > - *bt = th->th_offset;
> > - bintime_addx(bt, th->th_scale * tc_delta(th));
> > + bts = (struct bintime *)(vm_offset_t)th + off;
>
> I don't like the merging. It obscures the code with conversions like this.
>
> > + *bt = *bts;
> > + scale = th->th_scale;
> > + delta = tc_delta(th);
> > +#ifdef _LP64
> > + if (__predict_false(th->th_large_delta <= delta)) {
> > + /* Avoid overflow for scale * delta. */
> > + bintime_helper(bt, scale, delta);
> > + bintime_addx(bt, (scale & 0xffffffff) * delta);
> > + } else {
> > + bintime_addx(bt, scale * delta);
> > + }
> > +#else
> > + /*
> > + * Use bintime_helper() unconditionally, since the fast
> > + * path in the above method is not so fast here, since
> > + * the 64 x 32 -> 64 bit multiplication is usually not
> > + * available in hardware and emulating it using 2
> > + * 32 x 32 -> 64 bit multiplications uses code much
> > + * like that in bintime_helper().
> > + */
> > + bintime_helper(bt, scale, delta);
> > + bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
> > +#endif
>
> Check that this method is really better. Without this, the complicated
> part is about half as large and duplicating it is smaller than this
> version.
Better in what sence ? I am fine with the C code, and asm code looks
good.
>
> > @@ -387,16 +430,8 @@ microuptime(struct timeval *tvp)
> > void
> > bintime(struct bintime *bt)
> > {
> > - struct timehands *th;
> > - u_int gen;
> >
> > - do {
> > - th = timehands;
> > - gen = atomic_load_acq_int(&th->th_generation);
> > - *bt = th->th_bintime;
> > - bintime_addx(bt, th->th_scale * tc_delta(th));
> > - atomic_thread_fence_acq();
> > - } while (gen == 0 || gen != th->th_generation);
>
> Duplicating this loop is much better than obfuscating it using inline
> functions. This loop was almost duplicated (except for the delta
> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
> 8 fflock ones). Now it is only duplicated 16 times.
How did you counted the 16 ? I can see only 4 instances in the unpatched
kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not
touch ffclock until the patch is finalized. After that, it would be
1 instance for kernel and 1 for userspace.
>
> > + binnouptime(bt, __offsetof(struct timehands, th_bintime));
> > }
> >
> > void
>
> Bruce
More information about the freebsd-hackers
mailing list