powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Tue Apr 2 22:27:45 UTC 2019

> On 24. Mar 2019, at 12:01, Konstantin Belousov <kostikbel at gmail.com> wrote:
> 
> On Sat, Mar 09, 2019 at 06:00:14PM +1100, Bruce Evans wrote:
>> I more strongly disclike (sic) the more complete merge.  The central APIs
>> have even more parameters and reduced type safety to describe objects as
>> (offset, size) pairs.
> I changed the patch to be type-safe.  Now I like it even more.  It provides
> 1. internal
> 2. concise
> 3. type-safe
> API to fetch data from timehands.  The implementation needs to be read
> only once.
Hi,

I'm a bit lost... I think this started to fix a problem on G5 PowerMacs.
Do you think this patch solves the problem. Should this be tested?
Or is this still work in progress or a general improvement not necessary
fixing the problem on G5 PowerMacs?

Best regards
Michael
> 
>> Small delicate loops are ideal for duplicating.  They are easier to
>> understand individually and short enough to compare without using diff
>> to see gratuitous and substantive differences.  Multiple instances are
>> only hard to write and maintain.  Since these multiple instances are
>> already written, they are only harder to maintain.
> This is a good argument to have bintime_off and getthmember unmerged
> (there are two small but delicate loops). The API is internal, so it is
> only matter for maintainer, which means that the absence of duplication
> is important. More, all that arguments clearly explain why there should
> be not twenty similar loops scattered over the source.
> 
>> 
>>>> XX  void
>>>> XX  binuptime(struct bintime *bt)
>>>> XX  {
>>>> XX @@ -361,7 +383,7 @@
>>>> XX  		th = timehands;
>>>> XX  		gen = atomic_load_acq_int(&th->th_generation);
>>>> XX  		*bt = th->th_offset;
>>>> XX -		bintime_addx(bt, th->th_scale * tc_delta(th));
>>>> XX +		bintime_adddelta(bt, th);
>>>> XX  		atomic_thread_fence_acq();
>>>> XX  	} while (gen == 0 || gen != th->th_generation);
>>>> XX  }
>>>> 
>>>> This is the kind of non-churning change that I like.
>>> Ok.  I made all cases where timehands are read, more uniform by
>>> moving calculations after the generation loop.  This makes the
>>> atomic part of the functions easier to see, and loop body has lower
>>> chance to hit generation reset.
>> 
>> I think this change is slightly worse:
>> - it increases register pressure.  'scale' and 'delta' must be read in a
>>   alost program program before the loop exit test.  The above order uses
>>   them and stores the results to memory, so more registers are free for
>>   the exit test.  i386 certainly runs out of registers.  IIRC, i386 now
>>   spills 'gen'.  It would have to spill something to load 'gen' or 'th'
>>   for the test.
> Which does not matter on any modern architecture anyway.
> 
>> - it enlarges the window between reading 'scale' and 'delta' and the
>>   caller seeing the results.  Preemption in this window gives results
>>   that may be far in the past.
> My opinion is that quickly exiting the code and avoid retry is more
> important (as in performance) than making an impression that we protect
> against preemption. If preemption is important for the caller, then
> the calling place must use some measures like interrupt disabling and
> re-checking the time after the operation. Preemption can occur after the
> loop exit with the same consequences.
> 
>> The 'get' name is another problem.  I would like all the get*time
>> functions and not add new names starting with 'get'.  The library
>> implementation already doesn't bother optimizing the get*time functions,
>> but always uses the hardware timecounter.
>> 
>> getfoo() is a more natural name than foo_get() for the action of getting
>> foo, but the latter is better for consistency, especially in code that
>> puts the subsystem name first in nearby code.
>> 
>> The get*time functions would be better if they were more like
>> time_second.  Note that time_second is racy if time_t is too larger
>> for the arch so that accesses to it are not atomic, as happens on
>> 32-bit arches with premature 64-bit time_t.  However, in this 32/64
>> case, the race is only run every 136 years, with the next event
>> scheduled in 2038, so this race is even less important now than other
>> events scheduled in 2038.  Bintimes are 96 or 128 bits, so directly
>> copying a global like time_second for them would race every 1/2**32
>> second on 2-bit arches or every 1 second on 64-bit arches.  Most of
>> the loops on the generation count are for fixing these races, but
>> perhaps a simpler method would work.  On 64-bit arches with atomic
>> 64 accesses on 32-bit boundaries, the following would work:
>> - set the lower 32 bits of the fraction to 0, or ignore them
>> - load the higher 32 bits of the fraction and the lower 32 bits of the
>>   seconds
>> - race once every 136 years starting in 2038 reading the higher 32 bits
>>   of the seconds non-atomically.
>> - alternatively, break instead of racing in 2038 by setting the higher
>>   32 bits to 0.  This is the same as using sbintimes instead of bintimes.
>> - drop a few more lower bits by storing a right-shifted value.  Right
>>   shifting by just 1 gives a race frequency of once per 272 years, with
>>   the next one in 2006.
> It would make sense if the functions were written from scratch, but since
> we already have the generation counts, it is not obvious that such change
> is useful.
> 
> But if we decide to go that route (later), my current patch only
> requires exactly one location getthmember() to experiment with and to
> change after. So you effectively made yet another, and perhaps most
> convincing, argument, for me.
> 
>> 
>>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
>>> index 2656fb4d22f..8d12847f2cd 100644
>>> --- a/sys/kern/kern_tc.c
>>> +++ b/sys/kern/kern_tc.c
>>> @@ -200,20 +202,56 @@ tc_delta(struct timehands *th)
>>> * the comment in <sys/time.h> for a description of these 12 functions.
>>> */
>>> 
>>> -#ifdef FFCLOCK
>>> -void
>>> -fbclock_binuptime(struct bintime *bt)
>>> +static __inline void
>>> +bintime_off(struct bintime *bt, u_int off)
>>> {
>>> 	struct timehands *th;
>>> -	unsigned int gen;
>>> +	struct bintime *btp;
>>> +	uint64_t scale, x;
>>> +	u_int delta, gen, large_delta;
>>> 
>>> 	do {
>>> 		th = timehands;
>>> 		gen = atomic_load_acq_int(&th->th_generation);
>>> -		*bt = th->th_offset;
>>> -		bintime_addx(bt, th->th_scale * tc_delta(th));
>> 
>> You didn't fully obfuscate this by combinining this function with
>> getthmember() so as to deduplicate the loop.
>> 
>>> +		btp = (struct bintime *)((vm_offset_t)th + off);
>> 
>> Ugly conversion to share code.  This is technically incorrect.  Improving
>> the casts gives:
>> 
>> 	btp = (void *)(uintptr_t)((uintptr_t)(void *)th + off);
>> 
>> but this assumes that arithmetic on the intermediate integer does what
>> is espected.  uintptr_t is only guaranteed to work when the intermediate
>> representation held in it is not adjusted.
> vm_offset_t has the semantic that is needed for the arithmetic.  It is
> better uintptr_t for kernel, where we know that all object pointers are
> compatible with vm_offset_t (AKA flat tag-less memory model).
> 
>> 
>> Fixing the API gives
>> 
>>     static __inline void
>>     bintime_off(struct bintime *btp, struct bintime *base_btp)
>> 
>> where base_btp is &th->th_bintime or &th->th_offset.
>> 
>> (th_offset and th_bintime are badly named.  th_offset is really a base
>> time and the offset is tc_delta().  th_bintime is also a base time.
>> It is the same as th_offset with another actual offset (the difference
>> between UTC and local time) already added to it as an optimization.  In
>> old versions, th_bintime didn't exist, but the related struct members
>> th_nanotime and th_microtime existed, since these benefit more from
>> not converting on every call.
> How could it be &th->th_offset, when th is calculated inside the call ?
> But I did modified the API in this spirit, indeed.  It takes the member
> name directly as an argument.
> 
>> 
>> My old version even documents the struct members, while -current still
>> has no comments.  The comments were lost to staticization.  My version
>> mostly adds "duh" to the banal comments after recovering them:
>> 
>> XX /*
>> XX  * XXX rotted comment cloned from <sys/timetc.h>.
>> XX  *
>> XX  * th_counter is undocumented (duh).
>> XX  *
>> XX  * th_adjustment [PPM << 16] which means that the smallest unit of correction
>> XX  *     you can apply amounts to 481.5 usec/year.
>> XX  *
>> XX  * th_scale is undocumented (duh).
>> XX  *
>> XX  * th_offset_count is the contents of the counter which corresponds to the
>> XX  *
>> XX  *     rest of the offset_* values.
>> XX  *
>> XX  * th_offset is undocumented (duh).
>> XX  *
>> XX  * th_microtime is undocumented (duh).
>> XX  *
>> XX  * th_nanotime is undocumented (duh).
>> XX  *
>> XX  * XXX especially massive bitrot here.  "three" is now "many"...
>> XX  * Each timecounter must supply an array of three timecounters.  This is needed
>> XX  * to guarantee atomicity in the code.  Index zero is used to transport
>> XX  * modifications, for instance done with sysctl, into the timecounter being
>> XX  * used in a safe way.  Such changes may be adopted with a delay of up to 1/HZ.
>> XX  * Index one and two are used alternately for the actual timekeeping.
>> XX  *
>> XX  * th_generation is undocumented (duh).
>> XX  *
>> XX  * th_next is undocumented (duh).
>> XX  */
>> 
>>> +		*bt = *btp;
>>> +		scale = th->th_scale;
>>> +		delta = tc_delta(th);
>>> +		large_delta = th->th_large_delta;
>> 
>> I had forgotten that th_scale is so volatile (it may be adjusted on
>> every windup).  th_large_delta is equally volatile.  So moving the
>> calculation outside of the loop gives even more register pressure
>> than I noticed above.
>> 
>>> 		atomic_thread_fence_acq();
>>> 	} while (gen == 0 || gen != th->th_generation);
>>> +
>>> +	if (__predict_false(delta < large_delta)) {
>>> +		/* Avoid overflow for scale * delta. */
>>> +		x = (scale >> 32) * delta;
>>> +		bt->sec += x >> 32;
>>> +		bintime_addx(bt, x << 32);
>>> +		bintime_addx(bt, (scale & 0xffffffff) * delta);
>>> +	} else {
>>> +		bintime_addx(bt, scale * delta);
>>> +	}
>>> +}
>>> +
>>> +static __inline void
>>> +getthmember(void *out, size_t out_size, u_int off)
>>> +{
>>> +	struct timehands *th;
>>> +	u_int gen;
>>> +
>>> +	do {
>>> +		th = timehands;
>>> +		gen = atomic_load_acq_int(&th->th_generation);
>>> +		memcpy(out, (char *)th + off, out_size);
>> 
>> This isn't so ugly or technically incorrect.  Now the object is generic,
>> so the reference to it should be passed as (void *objp, size_t objsize)
>> instead of the type-safe (struct bintime *base_bpt).
> _Generic is what gave me a hint how to make the implementation type-safe.
> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 2656fb4d22f..4e94f762026 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> @@ -72,6 +72,7 @@ struct timehands {
> 	struct timecounter	*th_counter;
> 	int64_t			th_adjustment;
> 	uint64_t		th_scale;
> +	u_int			th_large_delta;
> 	u_int	 		th_offset_count;
> 	struct bintime		th_offset;
> 	struct bintime		th_bintime;
> @@ -90,6 +91,7 @@ static struct timehands th1 = {
> static struct timehands th0 = {
> 	.th_counter = &dummy_timecounter,
> 	.th_scale = (uint64_t)-1 / 1000000,
> +	.th_large_delta = 1000000,
> 	.th_offset = { .sec = 1 },
> 	.th_generation = 1,
> 	.th_next = &th1
> @@ -200,20 +202,72 @@ tc_delta(struct timehands *th)
>  * the comment in <sys/time.h> for a description of these 12 functions.
>  */
> 
> -#ifdef FFCLOCK
> -void
> -fbclock_binuptime(struct bintime *bt)
> +static __inline void
> +bintime_off(struct bintime *bt, u_int off)
> {
> 	struct timehands *th;
> -	unsigned int gen;
> +	struct bintime *btp;
> +	uint64_t scale, x;
> +	u_int delta, gen, large_delta;
> 
> 	do {
> 		th = timehands;
> 		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_offset;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> +		btp = (struct bintime *)((vm_offset_t)th + off);
> +		*bt = *btp;
> +		scale = th->th_scale;
> +		delta = tc_delta(th);
> +		large_delta = th->th_large_delta;
> 		atomic_thread_fence_acq();
> 	} while (gen == 0 || gen != th->th_generation);
> +
> +	if (__predict_false(delta >= large_delta)) {
> +		/* Avoid overflow for scale * delta. */
> +		x = (scale >> 32) * delta;
> +		bt->sec += x >> 32;
> +		bintime_addx(bt, x << 32);
> +		bintime_addx(bt, (scale & 0xffffffff) * delta);
> +	} else {
> +		bintime_addx(bt, scale * delta);
> +	}
> +}
> +#define	GETTHBINTIME(dst, member)						\
> +do {									\
> +	_Static_assert(_Generic(((struct timehands *)NULL)->member,	\
> +	    struct bintime: 1, default: 0) == 1,			\
> +	    "struct timehands member is not of struct bintime type");	\
> +	bintime_off(dst, __offsetof(struct timehands, member));		\
> +} while (0)
> +
> +static __inline void
> +getthmember(void *out, size_t out_size, u_int off)
> +{
> +	struct timehands *th;
> +	u_int gen;
> +
> +	do {
> +		th = timehands;
> +		gen = atomic_load_acq_int(&th->th_generation);
> +		memcpy(out, (char *)th + off, out_size);
> +		atomic_thread_fence_acq();
> +	} while (gen == 0 || gen != th->th_generation);
> +}
> +#define	GETTHMEMBER(dst, member)					\
> +do {									\
> +	_Static_assert(_Generic(*dst,					\
> +	    __typeof(((struct timehands *)NULL)->member): 1,		\
> +	    default: 0) == 1,						\
> +	    "*dst and struct timehands member have different types");	\
> +	getthmember(dst, sizeof(*dst), __offsetof(struct timehands,	\
> +	    member));							\
> +} while (0)
> +
> +#ifdef FFCLOCK
> +void
> +fbclock_binuptime(struct bintime *bt)
> +{
> +
> +	GETTHBINTIME(bt, th_offset);
> }
> 
> void
> @@ -237,16 +291,8 @@ fbclock_microuptime(struct timeval *tvp)
> void
> fbclock_bintime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	unsigned int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_bintime;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHBINTIME(bt, th_bintime);
> }
> 
> void
> @@ -270,100 +316,55 @@ fbclock_microtime(struct timeval *tvp)
> void
> fbclock_getbinuptime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	unsigned int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_offset;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(bt, th_offset);
> }
> 
> void
> fbclock_getnanouptime(struct timespec *tsp)
> {
> -	struct timehands *th;
> -	unsigned int gen;
> +	struct bintime bt;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		bintime2timespec(&th->th_offset, tsp);
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(&bt, th_offset);
> +	bintime2timespec(&bt, tsp);
> }
> 
> void
> fbclock_getmicrouptime(struct timeval *tvp)
> {
> -	struct timehands *th;
> -	unsigned int gen;
> +	struct bintime bt;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		bintime2timeval(&th->th_offset, tvp);
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(&bt, th_offset);
> +	bintime2timeval(&bt, tvp);
> }
> 
> void
> fbclock_getbintime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	unsigned int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_bintime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(bt, th_bintime);
> }
> 
> void
> fbclock_getnanotime(struct timespec *tsp)
> {
> -	struct timehands *th;
> -	unsigned int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*tsp = th->th_nanotime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(tsp, th_nanotime);
> }
> 
> void
> fbclock_getmicrotime(struct timeval *tvp)
> {
> -	struct timehands *th;
> -	unsigned int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*tvp = th->th_microtime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(tvp, th_microtime);
> }
> #else /* !FFCLOCK */
> +
> void
> binuptime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_offset;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHBINTIME(bt, th_offset);
> }
> 
> void
> @@ -387,16 +388,8 @@ microuptime(struct timeval *tvp)
> void
> bintime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_bintime;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHBINTIME(bt, th_bintime);
> }
> 
> void
> @@ -420,85 +413,47 @@ microtime(struct timeval *tvp)
> void
> getbinuptime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_offset;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(bt, th_offset);
> }
> 
> void
> getnanouptime(struct timespec *tsp)
> {
> -	struct timehands *th;
> -	u_int gen;
> +	struct bintime bt;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		bintime2timespec(&th->th_offset, tsp);
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(&bt, th_offset);
> +	bintime2timespec(&bt, tsp);
> }
> 
> void
> getmicrouptime(struct timeval *tvp)
> {
> -	struct timehands *th;
> -	u_int gen;
> +	struct bintime bt;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		bintime2timeval(&th->th_offset, tvp);
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(&bt, th_offset);
> +	bintime2timeval(&bt, tvp);
> }
> 
> void
> getbintime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_bintime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(bt, th_bintime);
> }
> 
> void
> getnanotime(struct timespec *tsp)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*tsp = th->th_nanotime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(tsp, th_nanotime);
> }
> 
> void
> getmicrotime(struct timeval *tvp)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*tvp = th->th_microtime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(tvp, th_microtime);
> }
> #endif /* FFCLOCK */
> 
> @@ -514,15 +469,8 @@ getboottime(struct timeval *boottime)
> void
> getboottimebin(struct bintime *boottimebin)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*boottimebin = th->th_boottime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(boottimebin, th_boottime);
> }
> 
> #ifdef FFCLOCK
> @@ -1038,15 +986,8 @@ getmicrotime(struct timeval *tvp)
> void
> dtrace_getnanotime(struct timespec *tsp)
> {
> -	struct timehands *th;
> -	u_int gen;
> 
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*tsp = th->th_nanotime;
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
> +	GETTHMEMBER(tsp, th_nanotime);
> }
> 
> /*
> @@ -1464,6 +1405,7 @@ tc_windup(struct bintime *new_boottimebin)
> 	scale += (th->th_adjustment / 1024) * 2199;
> 	scale /= th->th_counter->tc_frequency;
> 	th->th_scale = scale * 2;
> +	th->th_large_delta = MIN(((uint64_t)1 << 63) / scale, UINT_MAX);
> 
> 	/*
> 	 * Now that the struct timehands is again consistent, set the new
> _______________________________________________
> freebsd-ppc at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-ppc
> To unsubscribe, send any mail to "freebsd-ppc-unsubscribe at freebsd.org"

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5387 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-ppc/attachments/20190403/d2490489/attachment.bin>