powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Thu Mar 7 14:31:41 UTC 2019

On Wed, 6 Mar 2019, Konstantin Belousov wrote:

> On Tue, Mar 05, 2019 at 05:17:14AM +1100, Bruce Evans wrote:
>> On Mon, 4 Mar 2019, Konstantin Belousov wrote:
>>
>>> On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote:
>>>> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
>>>>
>>>>> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
>* ...
>> I strongly disklike the merge.
>>
>>>>> So I verified that:
>>>>> - there is no 64bit multiplication in the generated code, for i386 both
>>>>>  for clang 7.0 and gcc 8.3;
>>>>> - that everything is inlined, the only call from bintime/binuptime is
>>>>>  the indirect call to get the timecounter value.
>>>>
>>>> I will have to fix it for compilers that I use.
>>> Ok, I will add __inline.
>>
>> That will make it fast enough, but still hard to read.
>>
>>>>> +		*bt = *bts;
>>>>> +		scale = th->th_scale;
>>>>> +		delta = tc_delta(th);
>>>>> +#ifdef _LP64
>>>>> +		if (__predict_false(th->th_large_delta <= delta)) {
>>>>> +			/* Avoid overflow for scale * delta. */
>>>>> +			bintime_helper(bt, scale, delta);
>>>>> +			bintime_addx(bt, (scale & 0xffffffff) * delta);
>>>>> +		} else {
>>>>> +			bintime_addx(bt, scale * delta);
>>>>> +		}
>>>>> +#else
>>>>> +		/*
>>>>> +		 * Use bintime_helper() unconditionally, since the fast
>>>>> +		 * path in the above method is not so fast here, since
>>>>> +		 * the 64 x 32 -> 64 bit multiplication is usually not
>>>>> +		 * available in hardware and emulating it using 2
>>>>> +		 * 32 x 32 -> 64 bit multiplications uses code much
>>>>> +		 * like that in bintime_helper().
>>>>> +		 */
>>>>> +		bintime_helper(bt, scale, delta);
>>>>> +		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
>>>>> +#endif
>>>>
>>>> Check that this method is really better.  Without this, the complicated
>>>> part is about half as large and duplicating it is smaller than this
>>>> version.
>>> Better in what sence ?  I am fine with the C code, and asm code looks
>>> good.
>>
>> Better in terms of actually running significantly faster.  I fear the
>> 32-bit method is actually slightly slower for the fast path.

I checked that it is just worse.  Significantly slower and more complicated.

I wrote and run a lot of timing benchmarks of various versions.  All
times in cycles on Haswell @4.08 GHz.  On i386 except where noted:

- the fastest case is when compiled by clang with the default of -O2.
   binuptime() in a loop then takes 34 cycles.  This is faster than possible
   for latency, since rdtsc alone has a latency of 24 cycles.  There must be
   several iterations of the loop running in parallel.

- the slowest case is when compiled by gcc-4.2.1 with my config  of -Os.
   binuptime() in a loop then takes 116 cycles.  -Os does at least the
   following pessimization: use memcpy() for copying the 12-byte struct
   bitime.

- gcc-4.2.1 -O2 takes 74 cycles.  -O2 still does the following pessimization:
   do a 64 x 32 -> 64 bit multiplication after not noticing that the first
   operand has been reduced to 32 bits by a shift or mask.

The above tests were done with the final version.  The version which tested
alternatives used switch (method) and takes about 20 cycles longer for the
fastest version, presumably by defeating parallelism.  Times for various
methods:

- with clang -Os, about 54 cycles for the old method that allowed overflow,
   and the same for the version with the check of the overflow threshold
   (but with the threshold never reached), and 59 cycles for the branch-
   free method.  100-116 cycles with gcc-4.2.1 -Os, with the branch-free
   method taking 5-10 cycles longer.

- on amd64, only a couple of cycles faster (49-50 cycles in best cases),
   and gcc-4.2.1 only taking a ouple of cycles longer.  The branch-free
   method still takes about 59 cycles so it is relatively worse.

In userland, using the syscall for syscall for clock_gettime(), the
extra 5-10 cycles for the branch-free method is relatively insignificat.
It is about 2 nanonseconds.  Other pessimizatations are more significant.
Times for this syscall:
- amd64 now: 224 nsec (with gcc-4.2.1 -Os)
- i386 4+4 nopae: 500-580 nsec (depending on clang/gcc and -O2/-Os)
   even getpid(2) takes 280 nsec.  Add at least 140 more nsec for pae.
- i386 3+1: 224 nsec (with gcc 4.2.1 -Os)
- i386 FreeBSD-5 UP: 193 nsec (with gcc-3.3.3 -O).
- i386 4+4 nopae old library version of clock_gettime() compiled by
   clang: 29 nsec.

In some tests, the version with the branch was even a cycle or two faster.
In the tests, the branch was always perfectly predicted, so costs nothing
except possibly by changing scheduling in an accidentally good way.  The
tests were too small to measure the cost of using branch prediction
resources.  I've never noticed a case where 1 more branch causes thrashing.

>>>>> -	do {
>>>>> -		th = timehands;
>>>>> -		gen = atomic_load_acq_int(&th->th_generation);
>>>>> -		*bt = th->th_bintime;
>>>>> -		bintime_addx(bt, th->th_scale * tc_delta(th));
>>>>> -		atomic_thread_fence_acq();
>>>>> -	} while (gen == 0 || gen != th->th_generation);
>>>>
>>>> Duplicating this loop is much better than obfuscating it using inline
>>>> functions.  This loop was almost duplicated (except for the delta
>>>> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
>>>> 8 fflock ones).  Now it is only duplicated 16 times.
>>> How did you counted the 16 ?  I can see only 4 instances in the unpatched
>>> kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not
>>> touch ffclock until the patch is finalized.  After that, it would be
>>> 1 instance for kernel and 1 for userspace.
>>
>> Grep for the end condition in this loop.  There are actually 20 of these.
>> I'm counting the loops and not the previously-simple scaling operation in
>> it.  The scaling is indeed only done for 4 cases.  I prefer the 20
>> duplications (except I only want about 6 of the functions).  Duplication
>> works even better for only 4 cases.
> Ok, I merged these as well.  Now there are only four loops left in kernel.
> I do not think that merging them is beneficial, since they have sufficiently
> different bodies.

This is exacly what I don't want.
>
> I disagree with you characterization of it as obfuscation, IMO it improves
> the maintainability of the code by reducing number of places which need
> careful inspection of the lock-less algorithm.

It makes the inspection and changes more difficult for each instance.
General functions are more difficult to work with since they need more
args to control them and can't be changed without affecting all callers.

In another thread, you didn't like similar churn for removing td args.
Here there isn't even a bug, since overflow only occurs when an invariant
is violated.

>> This should be written as a function call to 1 new function to replace
>> the line with the overflowing multiplication.  The line is always the
>> same, so the new function call can look like bintime_xxx(bt, th).
> Again, please provide at least of a pseudocode of your preference.

The following is a complete tested and benchmarked implementation, with a
couple more minor fixes:

XX Index: kern_tc.c
XX ===================================================================
XX --- kern_tc.c	(revision 344852)
XX +++ kern_tc.c	(working copy)
XX @@ -72,6 +72,7 @@
XX  	struct timecounter	*th_counter;
XX  	int64_t			th_adjustment;
XX  	uint64_t		th_scale;
XX +	u_int			th_large_delta;
XX  	u_int	 		th_offset_count;
XX  	struct bintime		th_offset;
XX  	struct bintime		th_bintime;

Improvement not already discussed: use a u_int limit for the u_int variable.

XX @@ -90,6 +91,7 @@
XX  static struct timehands th0 = {
XX  	.th_counter = &dummy_timecounter,
XX  	.th_scale = (uint64_t)-1 / 1000000,
XX +	.th_large_delta = 1000000,
XX  	.th_offset = { .sec = 1 },
XX  	.th_generation = 1,
XX  	.th_next = &th1

Fix not already discussed: th_large_delta was used in the dummy timehands
before it was initialized.  Static initialization to 0 gives fail-safe
behaviour and unintended exercizing of the slow path.

The dummy timecounter has a low frequency, so its overflow threshold is
quite low.  I think it is not used even 1000000 times unless there is a
bug in the boot code, so it doesn't overflow in practice.  I did see
some strange crashes at boot time while testing this.

XX @@ -351,6 +353,26 @@
XX  	} while (gen == 0 || gen != th->th_generation);
XX  }
XX  #else /* !FFCLOCK */
XX +
XX +static __inline void
XX +bintime_adddelta(struct bintime *bt, struct timehands *th)

Only 1 utility function now.

XX +{
XX +	uint64_t scale, x;
XX +	u_int delta;
XX +
XX +	scale = th->th_scale;
XX +	delta = tc_delta(th);
XX +	if (__predict_false(delta < th->th_large_delta)) {
XX +		/* Avoid overflow for scale * delta. */
XX +		x = (scale >> 32) * delta;
XX +		bt->sec += x >> 32;
XX +		bintime_addx(bt, x << 32);
XX +		bintime_addx(bt, (scale & 0xffffffff) * delta);

This is clearer with all the scaling code together.

I thought of renaming x to x95_32 to sort of document that it holds bits
95..32 in a component of the product.

XX +	} else {
XX +		bintime_addx(bt, scale * delta);
XX +	}
XX +}
XX +
XX  void
XX  binuptime(struct bintime *bt)
XX  {
XX @@ -361,7 +383,7 @@
XX  		th = timehands;
XX  		gen = atomic_load_acq_int(&th->th_generation);
XX  		*bt = th->th_offset;
XX -		bintime_addx(bt, th->th_scale * tc_delta(th));
XX +		bintime_adddelta(bt, th);
XX  		atomic_thread_fence_acq();
XX  	} while (gen == 0 || gen != th->th_generation);
XX  }

This is the kind of non-churning change that I like.

The function name bintime_adddelta() isn't so good, but it is in the same
style as bintime_addx() where the names are worse.  bintime_addx() is global
so it needs a descriptive name more.  'delta' is more descriptive than 'x'
(x means a scalar and not a bintime).  The 'bintime' prefix is verbose.  It
should be bt, especially in non-global APIs.

XX @@ -394,7 +416,7 @@
XX  		th = timehands;
XX  		gen = atomic_load_acq_int(&th->th_generation);
XX  		*bt = th->th_bintime;
XX -		bintime_addx(bt, th->th_scale * tc_delta(th));
XX +		bintime_adddelta(bt, th);
XX  		atomic_thread_fence_acq();
XX  	} while (gen == 0 || gen != th->th_generation);
XX  }
XX @@ -1464,6 +1486,7 @@
XX  	scale += (th->th_adjustment / 1024) * 2199;
XX  	scale /= th->th_counter->tc_frequency;
XX  	th->th_scale = scale * 2;
XX +	th->th_large_delta = MIN(((uint64_t)1 << 63) / scale, UINT_MAX);
XX 
XX  	/*
XX  	 * Now that the struct timehands is again consistent, set the new

Clamp this to UINT_MAX now that it is stored in a u_int.

> The current patch becomes to large already, I want to test/commit what
> I already have, and I will need to split it for the commit.

It was already too large.
>
> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 2656fb4d22f..7114a0e5219 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> ...
> @@ -200,22 +201,77 @@ tc_delta(struct timehands *th)
>  * the comment in <sys/time.h> for a description of these 12 functions.
>  */
>
> -#ifdef FFCLOCK
> -void
> -fbclock_binuptime(struct bintime *bt)
> +static __inline void
> +bintime_helper(struct bintime *bt, uint64_t scale, u_int delta)

This name is not descriptive.

> +static __inline void
> +binnouptime(struct bintime *bt, u_int off)

This name is an example of further problems with the naming scheme.
The bintime_ prefix used above is verbose, but it is at least a prefix
and is in the normal bintime_ namespace.  Here the prefix is 'bin',
which is neither of these.  It means bintime_ again, but this duplicates
'time'.

If I liked churn, then I would have changed all names here long ago.
E.g.:
- bintime_ -> bt_, and use it consistently
- timecounter -> tc except for the timecounter public variable
- fb_ -> facebook_ -> /dev/null.  Er, fb_ -> fbt_ or -> ft_.
- bt -> btp when bt is a pointer.  You used bts for a struct in this patch
- unsigned int -> u_int.  I policed this in early timecounter code.
   You fixed some instances of this too.
- th_generation -> th_gen.

Bruce