powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Mon Mar 4 11:42:00 UTC 2019

On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote:
> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> 
> > On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
> >> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> >>
> >>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> >>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>>>
> >>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> > * ...
> >>>> Yes, that was its point.  It is a bit annoying to have a hardware
> >>>> timecounter like the TSC that doesn't wrap naturally, but then make it
> >>>> wrap by masking high bits.
> >>>>
> >>>> The masking step is also a bit wasteful.  For the TSC, it is 1 step to
> >>>> discard high bids at the register level, then another step to apply the
> >>>> nask to discard th high bits again.
> >>> rdtsc-low is implemented in the natural way, after RDTSC, no register
> >>> combining into 64bit value is done, instead shrd operates on %edx:%eax
> >>> to get the final result into %eax.  I am not sure what you refer to.
> >>
> >> I was referring mostly to the masking step '& tc->tc_counter_mask' and
> >> the lack of register combining in rdtsc().
> >>
> >> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
> >> step.  i386 used to be faster here -- the first masking step of discarding
> >> %edx doesn't take any code.  amd64 has to mask out the top bits in %rax.
> >> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
> >> has to do a not so slow shr.
> > i386 cannot discard %edx after RDTSC since some bits from %edx come into
> > the timecounter value.
> 
> These bits are part of the tsc-low pessimization.  The shift count should
> always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX
> sometimes.
> 
> When tsc-low was new, the shift count was often larger (as much as 8),
> and it is still changeable by a read-only tunable, but now it is 1 in
> almost all cases.  The code only limits the timecounter frequency
> to UINT_MAX, except the tunable defaults to 1 so average CPUs running
> at nearly 4 GHz are usually limited to about 2 GHz.  The comment about
> this UINT_MAX doesn't match the code.  The comment says int, but the
> code says UINT.
> 
> All that a shoft count of 1 does is waste time to lose 1 bit of accuracy.
> This much accuracy is noise for most purposes.
> 
> The tunable is fairly undocumented.  Its description is "Shift to apply
> for the maximum TSC frequency".  Of course, it has no effect on the TSC
> frequency.  It only affects the TSC timecounter frequency.
I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
Otherwise, I think, some multi-socket machines would start showing the
detectable backward-counting bintime().  At the frequencies at 4GHz and
above (Intel has 5Ghz part numbers) I do not think that stability of
100MHz crystall and on-board traces is enough to avoid that.

We can try to set the tsc-low shift count to 0 (but keep lfence) and see
what is going on in HEAD, but I am afraid that the HEAD users population
is not representative enough to catch the issue with the certainity.
More, it is unclear to me how to diagnose the cause, e.g. I would expect
the sleeps to hang on timeouts, as was reported from the very beginning
of this thread. How would we root-cause it ?

> 
> The cputicker normally uses the TSC without even an lfence.  This use
> only has to be monotonic per-CPU, so this is OK.  Also, any bugs hidden
> by discarding low bits shouldn't show up per-CPU.  However, keeping
> the cputicker below 4G actually has some efficiency advantages.  For
> timecounters, there are no multiplications or divisions by the frequency
> in the fast path, but cputicker use isn't so optimized and it does a
> slow 64-bit division in cputick2usec().  Keeping cpu_tick_freqency
> below UINT_MAX allows dividing by it in integer arithmetic in some cases,
> This optimization is not done.
> 
> > amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
> > since the whole shrdl calculation occurs in 32bit registers, and the result
> > is in %rax where top word is cleared by shrdl instruction automatically.
> > But the clearing is not required since result is unsigned int anyway.
> >
> > Dissassemble of tsc_get_timecount_low() is very clear:
> >   0xffffffff806767e4 <+4>:     mov    0x30(%rdi),%ecx
> >   0xffffffff806767e7 <+7>:     rdtsc
> >   0xffffffff806767e9 <+9>:     shrd   %cl,%edx,%eax
> > ...
> >   0xffffffff806767ed <+13>:    retq
> > (I removed frame manipulations).
> 
> It would without the shift pessimization, since the function returns uint32_t
> but rdtsc() gives uint64_t.  Removing the top bits is not needed since
> tc_delta() removes them again, but the API doesn't allow expressing this.
> 
> Without the shift pessimization, we just do rdtsc() in all cases and don't
> need this function call.  I think this is about 5-10 cycles faster after
> some parallelism.
> 
> >>>> I prefer my way of writing this in 3 lines.  Modifying 'scale' for
> >>>> the next step is especially ugly and pessimal when the next step is
> >>>> in the caller and this function is not inlined.
> >>> Can you show exactly what do you want ?
> >>
> >> Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers,
> >> and don't pass 'scale' indirectly to bintime_helper() and don't modify
> >> it there.
> >>
> >> Oops, there is a problem.  'scale' must be reduced iff bintime_helper()
> >> was used.  Duplicate some source code so as to not need a fall-through
> >> to the fast path.  See below.
> > Yes, this is the reason why it is passed by pointer (C has no references).
> 
> The indirection is slow no matter how it is spelled, unless it is inlined
> away.
> 
> >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> >>> index 2656fb4d22f..6c41ab22288 100644
> >>> --- a/sys/kern/kern_tc.c
> >>> +++ b/sys/kern/kern_tc.c
> >>> @@ -72,6 +71,7 @@ struct timehands {
> >>> 	struct timecounter	*th_counter;
> >>> 	int64_t			th_adjustment;
> >>> 	uint64_t		th_scale;
> >>> +	uint64_t		th_large_delta;
> >>> 	u_int	 		th_offset_count;
> >>> 	struct bintime		th_offset;
> >>> 	struct bintime		th_bintime;
> >>> @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
> >>> 	} while (gen == 0 || gen != th->th_generation);
> >>> }
> >>> #else /* !FFCLOCK */
> >>> +
> >>> +static void
> >>
> >> Add __inline.  This is in the fast path for 32-bit systems.
> > Compilers do not need this hand-holding, and I prefer to avoid __inline
> > unless really necessary.  I checked with both clang 7.0 and gcc 8.3
> > that autoinlining did occured.
> 
> But they do.  I don't use either of these compilers, and turn of inlining
> as much as possible anyway using -fno-inline -fno-inline-functions-called-
> once (this is very broken in clang -- -fno-inline turns off inlining of
> even functions declared as __inline (like curthread), and clang doesn't
> support -fno-inline -fno-inline-functions-called-once.
> 
> >> ...
> >> Similarly in bintime().
> > I merged two functions, finally.  Having to copy the same code is too
> > annoying for this change.
> >
> > So I verified that:
> > - there is no 64bit multiplication in the generated code, for i386 both
> >  for clang 7.0 and gcc 8.3;
> > - that everything is inlined, the only call from bintime/binuptime is
> >  the indirect call to get the timecounter value.
> 
> I will have to fix it for compilers that I use.
Ok, I will add __inline.

> 
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 2656fb4d22f..0fd39e25058 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> + ...
> > +static void
> > +binnouptime(struct bintime *bt, u_int off)
> > {
> > 	struct timehands *th;
> > -	u_int gen;
> > +	struct bintime *bts;
> > +	uint64_t scale;
> > +	u_int delta, gen;
> >
> > 	do {
> > 		th = timehands;
> > 		gen = atomic_load_acq_int(&th->th_generation);
> > -		*bt = th->th_offset;
> > -		bintime_addx(bt, th->th_scale * tc_delta(th));
> > +		bts = (struct bintime *)(vm_offset_t)th + off;
> 
> I don't like the merging.  It obscures the code with conversions like this.
> 
> > +		*bt = *bts;
> > +		scale = th->th_scale;
> > +		delta = tc_delta(th);
> > +#ifdef _LP64
> > +		if (__predict_false(th->th_large_delta <= delta)) {
> > +			/* Avoid overflow for scale * delta. */
> > +			bintime_helper(bt, scale, delta);
> > +			bintime_addx(bt, (scale & 0xffffffff) * delta);
> > +		} else {
> > +			bintime_addx(bt, scale * delta);
> > +		}
> > +#else
> > +		/*
> > +		 * Use bintime_helper() unconditionally, since the fast
> > +		 * path in the above method is not so fast here, since
> > +		 * the 64 x 32 -> 64 bit multiplication is usually not
> > +		 * available in hardware and emulating it using 2
> > +		 * 32 x 32 -> 64 bit multiplications uses code much
> > +		 * like that in bintime_helper().
> > +		 */
> > +		bintime_helper(bt, scale, delta);
> > +		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
> > +#endif
> 
> Check that this method is really better.  Without this, the complicated
> part is about half as large and duplicating it is smaller than this
> version.
Better in what sence ?  I am fine with the C code, and asm code looks
good.

> 
> > @@ -387,16 +430,8 @@ microuptime(struct timeval *tvp)
> > void
> > bintime(struct bintime *bt)
> > {
> > -	struct timehands *th;
> > -	u_int gen;
> >
> > -	do {
> > -		th = timehands;
> > -		gen = atomic_load_acq_int(&th->th_generation);
> > -		*bt = th->th_bintime;
> > -		bintime_addx(bt, th->th_scale * tc_delta(th));
> > -		atomic_thread_fence_acq();
> > -	} while (gen == 0 || gen != th->th_generation);
> 
> Duplicating this loop is much better than obfuscating it using inline
> functions.  This loop was almost duplicated (except for the delta
> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
> 8 fflock ones).  Now it is only duplicated 16 times.
How did you counted the 16 ?  I can see only 4 instances in the unpatched
kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not
touch ffclock until the patch is finalized.  After that, it would be
1 instance for kernel and 1 for userspace.

> 
> > +	binnouptime(bt, __offsetof(struct timehands, th_bintime));
> > }
> >
> > void
> 
> Bruce