powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Fri Mar 1 20:15:46 UTC 2019

On 2019-Mar-1, at 10:40, Bruce Evans <brde at optusnet.com.au> wrote:

> On Fri, 1 Mar 2019, Konstantin Belousov wrote:
> 
>> On Thu, Feb 28, 2019 at 11:39:06PM -0800, Mark Millard wrote:
>>> [The new, trial code also has truncation occurring.]
>> In fact no, I do not think it is.
>> 
>>> An example observation of diff_scaled having an overflowed
>>> value was:
>>> 
>>> scale_factor            == 0x80da2067ac
>>> scale_factor*freq overflows unsigned, 64 bit representation.
>>> tim_offset              ==   0x3da0eaeb
>>> tim_cnt                 ==   0x42dea3c4
>>> tim_diff                ==    0x53db8d9
>>> For reference:                0x1fc9d43 == 0xffffffffffffffffull/scale_factor
>>> scaled_diff       == 0xA353A5BF3FF780CC (truncated to 64 bits)
>>> 
>>> But for the new, trail code:
>>> 
>>> 0x80da2067ac is 40 bits
>>>   0x53db8d9 is 27 bits
>>> So 67 bits, more than 63. Then:
>>> 
>>>   x
>>> == (0x80da2067ac>>32) * 0x53db8d9
>>> == 0x80 * 0x53db8d9
>>> == 0x29EDC6C80
>>> 
>>>   x>>32
>>> == 0x2
>>> 
>>>   x<<32
>>> == 0x9EDC6C8000000000 (limited to 64 bits)
>>> Note the truncation of: 0x29EDC6C8000000000.
>> Right, this is how the patch is supposed to work.  Note that the overflow
>> bits 'lost' due to overflow of the left shift are the same bits that as
>> used to increment bt->sec:
>> 	bt->sec += x >> 32;
>> So the 2 seconds are accounted for.
>> 
>>> 
>>> Thus the "bintime_addx(bt, x << 32)" is still
>>> based on a truncated value.
>> 
>> I must admit that 2 seconds of interval where the timehands where
>> not updated is too much.  This might be the real cause of all ppc
>> troubles.  I tried to see if the overflow case is possible on amd64,
>> and did not get a single case of the '> 63' branch executed during the
>> /usr/tests/lib/libc run.
> 
> The algorithm requires the update interval to be less than 1 second.
> th_scale is 2**64 / tc_frequency, so whatever tc_frequency is, after
> 1 second the value of the multiplication is approximately 2**64 so
> it overflows about then (depending on rounding).

I've not tracked down evidence below binuptime on powerpc64 yet.
I'm not sure how much of the substructure I can investigate.

The context here is 2 sockets, each with 2 cores.

> The most useful timecounters are TSC's, and these give another overflow
> in tc_delta() after 1 second when their frequency is 4 GHz (except the
> bogus TSC-low timecounter reduces the frequency to below 2 binary GHz,
> so the usual case is overflow after 2 seconds).

The wording suggests a amd64/i386 context but my report was for
powerpc64, specifically for old PowerMac G5's. (I currently have\access
to only one ut I've seen the beuavior on others in the last.)
FreeBSD reports:

# sysctl kern.timecounter
kern.timecounter.tc.timebase.quality: 0
kern.timecounter.tc.timebase.frequency: 33333333
kern.timecounter.tc.timebase.counter: 1831468476
kern.timecounter.tc.timebase.mask: 4294967295
kern.timecounter.stepwarnings: 0
kern.timecounter.alloweddeviation: 5
kern.timecounter.hardware: timebase
kern.timecounter.choice: timebase(0) dummy(-1000000)
kern.timecounter.tick: 1
kern.timecounter.fast_gettime: 1

FreeBSD uses the lower 32 bits of the tbr (via
mftb). As I understand there are problems with
how close the tbr's can be kept.

>> Actually, the same overflow-prone code exists in libc, so below is the
>> updated patch:
>> - I added __predict_false()
>> - libc multiplication is also done separately for high-order bits.
>> (fftclock counterpart is still pending).
>> 
>> diff --git a/lib/libc/sys/__vdso_gettimeofday.c b/lib/libc/sys/__vdso_gettimeofday.c
>> index 3749e0473af..a14576988ff 100644
>> --- a/lib/libc/sys/__vdso_gettimeofday.c
>> +++ b/lib/libc/sys/__vdso_gettimeofday.c
>> @@ -32,6 +32,8 @@ __FBSDID("$FreeBSD$");
>> #include <sys/time.h>
>> #include <sys/vdso.h>
>> #include <errno.h>
>> +#include <limits.h>
>> +#include <strings.h>
>> #include <time.h>
>> #include <machine/atomic.h>
>> #include "libc_private.h"
>> @@ -62,6 +64,7 @@ binuptime(struct bintime *bt, struct vdso_timekeep *tk, int abs)
>> {
>> 	struct vdso_timehands *th;
>> 	uint32_t curr, gen;
>> +	uint64_t scale, x;
>> 	u_int delta;
>> 	int error;
>> 
>> @@ -78,7 +81,14 @@ binuptime(struct bintime *bt, struct vdso_timekeep *tk, int abs)
>> 			continue;
>> 		if (error != 0)
>> 			return (error);
>> -		bintime_addx(bt, th->th_scale * delta);
>> +		scale = th->th_scale;
>> +		if (__predict_false(fls(scale) + fls(delta) > 63)) {
> 
> This is unnecessarily pessimal.  Updates must be frequent enough to prevent
> tc_delta() overflowing, and it is even easier to arrange that this
> multiplication doesn't overflow (since the necessary update interval for
> the latter is constant).
> 
> `scale' is 64 bits, so fls(scale) is broken on 32-bit arches, and
> flls(scale) is an especially large pessimization.  I saw this on my
> calcru1() fixes -- the flls()s take almost as long as long long
> divisions when the use the pessimal C versions.

Unlike i386's and amd64's:

static __inline __pure2 int
fls(int mask)
{
        return (mask == 0 ? mask : (int)bsrl((u_int)mask) + 1);
}

powerpc64's apparently use /usr/src/sys/libkern/fls.c 's:

int
fls(int mask)
{
        int bit;

        if (mask == 0)
                return (0);
        for (bit = 1; mask != 1; bit++)
                mask = (unsigned int)mask >> 1;
        return (bit);
}

(At least I did not find an alternate for the
powerpc64 context.)

So, if I got that right, fls does a lot of looping
on powerpc64.

> The algorithm requires tc_delta() to be only 32 bits, since otherwise the
> multiplication would be 64 x 64 bits so would be much slower and harder
> to write.
> 
> If tc_freqency is far above 4GHz, then th_scale is far below 4G, so the
> scaling is not so accurate.  But 0.25 parts per billion is much more than
> enough.  Even 1 part per million is enough for a TSC, since TSC instability
> is more than 1ppm.  The overflows could be pushed off to 1024 seconds by
> dividing by 1024 at suitable places.  A 32-bit scale times a 64-bit delta
> would be simple compared with both 64 bits (much like the code here).
> 
> The code here can be optimized using values calculated at initialization
> time instead of fls*().  The overflow threshold for delta is approximately
> 2**64 / tc_frequency.
> 
>> +			x = (scale >> 32) * delta;
>> +			scale &= UINT_MAX;
>> +			bt->sec += x >> 32;
>> +			bintime_addx(bt, x << 32);
>> +		}
>> +		bintime_addx(bt, scale * delta);
>> 		if (abs)
>> 			bintime_add(bt, &th->th_boottime);
> 
> When the timecounter is the i8254, as it often was when timecounters
> were new, tc_windup() had to be called more often than every i8254 rollover
> (in practice once every hardclock tick), partly to keep tc_delta() small
> (since rollover gives a form of overflow).  This was not so easy to arrange.
> It requires not losing any hardclock ticks and also not having any with
> high latency and also complications to detect the rollover when there is
> small latency.  Most hardware is easier to handle now.  With tickless
> kernels, hardclock() is often not called for about 1 second, but it must
> be called at least that often to prevent the overflow here.

Thanks for the notes. I've been wondering around in unfamiliar territory.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)