powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Sun Mar 24 11:01:48 UTC 2019

On Sat, Mar 09, 2019 at 06:00:14PM +1100, Bruce Evans wrote:
> I more strongly disclike (sic) the more complete merge.  The central APIs
> have even more parameters and reduced type safety to describe objects as
> (offset, size) pairs.
I changed the patch to be type-safe.  Now I like it even more.  It provides
1. internal
2. concise
3. type-safe
API to fetch data from timehands.  The implementation needs to be read
only once.

> Small delicate loops are ideal for duplicating.  They are easier to
> understand individually and short enough to compare without using diff
> to see gratuitous and substantive differences.  Multiple instances are
> only hard to write and maintain.  Since these multiple instances are
> already written, they are only harder to maintain.
This is a good argument to have bintime_off and getthmember unmerged
(there are two small but delicate loops). The API is internal, so it is
only matter for maintainer, which means that the absence of duplication
is important. More, all that arguments clearly explain why there should
be not twenty similar loops scattered over the source.

> 
> >> XX  void
> >> XX  binuptime(struct bintime *bt)
> >> XX  {
> >> XX @@ -361,7 +383,7 @@
> >> XX  		th = timehands;
> >> XX  		gen = atomic_load_acq_int(&th->th_generation);
> >> XX  		*bt = th->th_offset;
> >> XX -		bintime_addx(bt, th->th_scale * tc_delta(th));
> >> XX +		bintime_adddelta(bt, th);
> >> XX  		atomic_thread_fence_acq();
> >> XX  	} while (gen == 0 || gen != th->th_generation);
> >> XX  }
> >>
> >> This is the kind of non-churning change that I like.
> > Ok.  I made all cases where timehands are read, more uniform by
> > moving calculations after the generation loop.  This makes the
> > atomic part of the functions easier to see, and loop body has lower
> > chance to hit generation reset.
> 
> I think this change is slightly worse:
> - it increases register pressure.  'scale' and 'delta' must be read in a
>    alost program program before the loop exit test.  The above order uses
>    them and stores the results to memory, so more registers are free for
>    the exit test.  i386 certainly runs out of registers.  IIRC, i386 now
>    spills 'gen'.  It would have to spill something to load 'gen' or 'th'
>    for the test.
Which does not matter on any modern architecture anyway.

> - it enlarges the window between reading 'scale' and 'delta' and the
>    caller seeing the results.  Preemption in this window gives results
>    that may be far in the past.
My opinion is that quickly exiting the code and avoid retry is more
important (as in performance) than making an impression that we protect
against preemption. If preemption is important for the caller, then
the calling place must use some measures like interrupt disabling and
re-checking the time after the operation. Preemption can occur after the
loop exit with the same consequences.

> The 'get' name is another problem.  I would like all the get*time
> functions and not add new names starting with 'get'.  The library
> implementation already doesn't bother optimizing the get*time functions,
> but always uses the hardware timecounter.
> 
> getfoo() is a more natural name than foo_get() for the action of getting
> foo, but the latter is better for consistency, especially in code that
> puts the subsystem name first in nearby code.
> 
> The get*time functions would be better if they were more like
> time_second.  Note that time_second is racy if time_t is too larger
> for the arch so that accesses to it are not atomic, as happens on
> 32-bit arches with premature 64-bit time_t.  However, in this 32/64
> case, the race is only run every 136 years, with the next event
> scheduled in 2038, so this race is even less important now than other
> events scheduled in 2038.  Bintimes are 96 or 128 bits, so directly
> copying a global like time_second for them would race every 1/2**32
> second on 2-bit arches or every 1 second on 64-bit arches.  Most of
> the loops on the generation count are for fixing these races, but
> perhaps a simpler method would work.  On 64-bit arches with atomic
> 64 accesses on 32-bit boundaries, the following would work:
> - set the lower 32 bits of the fraction to 0, or ignore them
> - load the higher 32 bits of the fraction and the lower 32 bits of the
>    seconds
> - race once every 136 years starting in 2038 reading the higher 32 bits
>    of the seconds non-atomically.
> - alternatively, break instead of racing in 2038 by setting the higher
>    32 bits to 0.  This is the same as using sbintimes instead of bintimes.
> - drop a few more lower bits by storing a right-shifted value.  Right
>    shifting by just 1 gives a race frequency of once per 272 years, with
>    the next one in 2006.
It would make sense if the functions were written from scratch, but since
we already have the generation counts, it is not obvious that such change
is useful.

But if we decide to go that route (later), my current patch only
requires exactly one location getthmember() to experiment with and to
change after. So you effectively made yet another, and perhaps most
convincing, argument, for me.

> 
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 2656fb4d22f..8d12847f2cd 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> > @@ -200,20 +202,56 @@ tc_delta(struct timehands *th)
> >  * the comment in <sys/time.h> for a description of these 12 functions.
> >  */
> >
> > -#ifdef FFCLOCK
> > -void
> > -fbclock_binuptime(struct bintime *bt)
> > +static __inline void
> > +bintime_off(struct bintime *bt, u_int off)
> > {
> > 	struct timehands *th;
> > -	unsigned int gen;
> > +	struct bintime *btp;
> > +	uint64_t scale, x;
> > +	u_int delta, gen, large_delta;
> >
> > 	do {
> > 		th = timehands;
> > 		gen = atomic_load_acq_int(&th->th_generation);
> > -		*bt = th->th_offset;
> > -		bintime_addx(bt, th->th_scale * tc_delta(th));
> 
> You didn't fully obfuscate this by combinining this function with
> getthmember() so as to deduplicate the loop.
> 
> > +		btp = (struct bintime *)((vm_offset_t)th + off);
> 
> Ugly conversion to share code.  This is technically incorrect.  Improving
> the casts gives:
> 
>  	btp = (void *)(uintptr_t)((uintptr_t)(void *)th + off);
> 
> but this assumes that arithmetic on the intermediate integer does what
> is espected.  uintptr_t is only guaranteed to work when the intermediate
> representation held in it is not adjusted.
vm_offset_t has the semantic that is needed for the arithmetic.  It is
better uintptr_t for kernel, where we know that all object pointers are
compatible with vm_offset_t (AKA flat tag-less memory model).

> 
> Fixing the API gives
> 
>      static __inline void
>      bintime_off(struct bintime *btp, struct bintime *base_btp)
> 
> where base_btp is &th->th_bintime or &th->th_offset.
> 
> (th_offset and th_bintime are badly named.  th_offset is really a base
> time and the offset is tc_delta().  th_bintime is also a base time.
> It is the same as th_offset with another actual offset (the difference
> between UTC and local time) already added to it as an optimization.  In
> old versions, th_bintime didn't exist, but the related struct members
> th_nanotime and th_microtime existed, since these benefit more from
> not converting on every call.
How could it be &th->th_offset, when th is calculated inside the call ?
But I did modified the API in this spirit, indeed.  It takes the member
name directly as an argument.

> 
> My old version even documents the struct members, while -current still
> has no comments.  The comments were lost to staticization.  My version
> mostly adds "duh" to the banal comments after recovering them:
> 
> XX /*
> XX  * XXX rotted comment cloned from <sys/timetc.h>.
> XX  *
> XX  * th_counter is undocumented (duh).
> XX  *
> XX  * th_adjustment [PPM << 16] which means that the smallest unit of correction
> XX  *     you can apply amounts to 481.5 usec/year.
> XX  *
> XX  * th_scale is undocumented (duh).
> XX  *
> XX  * th_offset_count is the contents of the counter which corresponds to the
> XX  *
> XX  *     rest of the offset_* values.
> XX  *
> XX  * th_offset is undocumented (duh).
> XX  *
> XX  * th_microtime is undocumented (duh).
> XX  *
> XX  * th_nanotime is undocumented (duh).
> XX  *
> XX  * XXX especially massive bitrot here.  "three" is now "many"...
> XX  * Each timecounter must supply an array of three timecounters.  This is needed
> XX  * to guarantee atomicity in the code.  Index zero is used to transport
> XX  * modifications, for instance done with sysctl, into the timecounter being
> XX  * used in a safe way.  Such changes may be adopted with a delay of up to 1/HZ.
> XX  * Index one and two are used alternately for the actual timekeeping.
> XX  *
> XX  * th_generation is undocumented (duh).
> XX  *
> XX  * th_next is undocumented (duh).
> XX  */
> 
> > +		*bt = *btp;
> > +		scale = th->th_scale;
> > +		delta = tc_delta(th);
> > +		large_delta = th->th_large_delta;
> 
> I had forgotten that th_scale is so volatile (it may be adjusted on
> every windup).  th_large_delta is equally volatile.  So moving the
> calculation outside of the loop gives even more register pressure
> than I noticed above.
> 
> > 		atomic_thread_fence_acq();
> > 	} while (gen == 0 || gen != th->th_generation);
> > +
> > +	if (__predict_false(delta < large_delta)) {
> > +		/* Avoid overflow for scale * delta. */
> > +		x = (scale >> 32) * delta;
> > +		bt->sec += x >> 32;
> > +		bintime_addx(bt, x << 32);
> > +		bintime_addx(bt, (scale & 0xffffffff) * delta);
> > +	} else {
> > +		bintime_addx(bt, scale * delta);
> > +	}
> > +}
> > +
> > +static __inline void
> > +getthmember(void *out, size_t out_size, u_int off)
> > +{
> > +	struct timehands *th;
> > +	u_int gen;
> > +
> > +	do {
> > +		th = timehands;
> > +		gen = atomic_load_acq_int(&th->th_generation);
> > +		memcpy(out, (char *)th + off, out_size);
> 
> This isn't so ugly or technically incorrect.  Now the object is generic,
> so the reference to it should be passed as (void *objp, size_t objsize)
> instead of the type-safe (struct bintime *base_bpt).
_Generic is what gave me a hint how to make the implementation type-safe.

diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 2656fb4d22f..4e94f762026 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -72,6 +72,7 @@ struct timehands {
 	struct timecounter	*th_counter;
 	int64_t			th_adjustment;
 	uint64_t		th_scale;
+	u_int			th_large_delta;
 	u_int	 		th_offset_count;
 	struct bintime		th_offset;
 	struct bintime		th_bintime;
@@ -90,6 +91,7 @@ static struct timehands th1 = {
 static struct timehands th0 = {
 	.th_counter = &dummy_timecounter,
 	.th_scale = (uint64_t)-1 / 1000000,
+	.th_large_delta = 1000000,
 	.th_offset = { .sec = 1 },
 	.th_generation = 1,
 	.th_next = &th1
@@ -200,20 +202,72 @@ tc_delta(struct timehands *th)
  * the comment in <sys/time.h> for a description of these 12 functions.
  */
 
-#ifdef FFCLOCK
-void
-fbclock_binuptime(struct bintime *bt)
+static __inline void
+bintime_off(struct bintime *bt, u_int off)
 {
 	struct timehands *th;
-	unsigned int gen;
+	struct bintime *btp;
+	uint64_t scale, x;
+	u_int delta, gen, large_delta;
 
 	do {
 		th = timehands;
 		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
+		btp = (struct bintime *)((vm_offset_t)th + off);
+		*bt = *btp;
+		scale = th->th_scale;
+		delta = tc_delta(th);
+		large_delta = th->th_large_delta;
 		atomic_thread_fence_acq();
 	} while (gen == 0 || gen != th->th_generation);
+
+	if (__predict_false(delta >= large_delta)) {
+		/* Avoid overflow for scale * delta. */
+		x = (scale >> 32) * delta;
+		bt->sec += x >> 32;
+		bintime_addx(bt, x << 32);
+		bintime_addx(bt, (scale & 0xffffffff) * delta);
+	} else {
+		bintime_addx(bt, scale * delta);
+	}
+}
+#define	GETTHBINTIME(dst, member)						\
+do {									\
+	_Static_assert(_Generic(((struct timehands *)NULL)->member,	\
+	    struct bintime: 1, default: 0) == 1,			\
+	    "struct timehands member is not of struct bintime type");	\
+	bintime_off(dst, __offsetof(struct timehands, member));		\
+} while (0)
+
+static __inline void
+getthmember(void *out, size_t out_size, u_int off)
+{
+	struct timehands *th;
+	u_int gen;
+
+	do {
+		th = timehands;
+		gen = atomic_load_acq_int(&th->th_generation);
+		memcpy(out, (char *)th + off, out_size);
+		atomic_thread_fence_acq();
+	} while (gen == 0 || gen != th->th_generation);
+}
+#define	GETTHMEMBER(dst, member)					\
+do {									\
+	_Static_assert(_Generic(*dst,					\
+	    __typeof(((struct timehands *)NULL)->member): 1,		\
+	    default: 0) == 1,						\
+	    "*dst and struct timehands member have different types");	\
+	getthmember(dst, sizeof(*dst), __offsetof(struct timehands,	\
+	    member));							\
+} while (0)
+
+#ifdef FFCLOCK
+void
+fbclock_binuptime(struct bintime *bt)
+{
+
+	GETTHBINTIME(bt, th_offset);
 }
 
 void
@@ -237,16 +291,8 @@ fbclock_microuptime(struct timeval *tvp)
 void
 fbclock_bintime(struct bintime *bt)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHBINTIME(bt, th_bintime);
 }
 
 void
@@ -270,100 +316,55 @@ fbclock_microtime(struct timeval *tvp)
 void
 fbclock_getbinuptime(struct bintime *bt)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(bt, th_offset);
 }
 
 void
 fbclock_getnanouptime(struct timespec *tsp)
 {
-	struct timehands *th;
-	unsigned int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timespec(&th->th_offset, tsp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(&bt, th_offset);
+	bintime2timespec(&bt, tsp);
 }
 
 void
 fbclock_getmicrouptime(struct timeval *tvp)
 {
-	struct timehands *th;
-	unsigned int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timeval(&th->th_offset, tvp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(&bt, th_offset);
+	bintime2timeval(&bt, tvp);
 }
 
 void
 fbclock_getbintime(struct bintime *bt)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(bt, th_bintime);
 }
 
 void
 fbclock_getnanotime(struct timespec *tsp)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tsp = th->th_nanotime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(tsp, th_nanotime);
 }
 
 void
 fbclock_getmicrotime(struct timeval *tvp)
 {
-	struct timehands *th;
-	unsigned int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tvp = th->th_microtime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(tvp, th_microtime);
 }
 #else /* !FFCLOCK */
+
 void
 binuptime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHBINTIME(bt, th_offset);
 }
 
 void
@@ -387,16 +388,8 @@ microuptime(struct timeval *tvp)
 void
 bintime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHBINTIME(bt, th_bintime);
 }
 
 void
@@ -420,85 +413,47 @@ microtime(struct timeval *tvp)
 void
 getbinuptime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(bt, th_offset);
 }
 
 void
 getnanouptime(struct timespec *tsp)
 {
-	struct timehands *th;
-	u_int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timespec(&th->th_offset, tsp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(&bt, th_offset);
+	bintime2timespec(&bt, tsp);
 }
 
 void
 getmicrouptime(struct timeval *tvp)
 {
-	struct timehands *th;
-	u_int gen;
+	struct bintime bt;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		bintime2timeval(&th->th_offset, tvp);
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(&bt, th_offset);
+	bintime2timeval(&bt, tvp);
 }
 
 void
 getbintime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(bt, th_bintime);
 }
 
 void
 getnanotime(struct timespec *tsp)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tsp = th->th_nanotime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(tsp, th_nanotime);
 }
 
 void
 getmicrotime(struct timeval *tvp)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tvp = th->th_microtime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(tvp, th_microtime);
 }
 #endif /* FFCLOCK */
 
@@ -514,15 +469,8 @@ getboottime(struct timeval *boottime)
 void
 getboottimebin(struct bintime *boottimebin)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*boottimebin = th->th_boottime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(boottimebin, th_boottime);
 }
 
 #ifdef FFCLOCK
@@ -1038,15 +986,8 @@ getmicrotime(struct timeval *tvp)
 void
 dtrace_getnanotime(struct timespec *tsp)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*tsp = th->th_nanotime;
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	GETTHMEMBER(tsp, th_nanotime);
 }
 
 /*
@@ -1464,6 +1405,7 @@ tc_windup(struct bintime *new_boottimebin)
 	scale += (th->th_adjustment / 1024) * 2199;
 	scale /= th->th_counter->tc_frequency;
 	th->th_scale = scale * 2;
+	th->th_large_delta = MIN(((uint64_t)1 << 63) / scale, UINT_MAX);
 
 	/*
 	 * Now that the struct timehands is again consistent, set the new