powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Wed Apr 3 15:47:41 UTC 2019

On Wed, 3 Apr 2019, Konstantin Belousov wrote:

> On Wed, Apr 03, 2019 at 12:27:32AM +0200, Michael Tuexen wrote:
>>> On 24. Mar 2019, at 12:01, Konstantin Belousov <kostikbel at gmail.com> wrote:
>>>
>>> On Sat, Mar 09, 2019 at 06:00:14PM +1100, Bruce Evans wrote:
>>>> I more strongly disclike (sic) the more complete merge.  The central APIs
>>>> have even more parameters and reduced type safety to describe objects as
>>>> (offset, size) pairs.
>>> I changed the patch to be type-safe.  Now I like it even more.  It provides
>>> 1. internal
>>> 2. concise
>>> 3. type-safe
>>> API to fetch data from timehands.  The implementation needs to be read
>>> only once.
>> Hi,
>>
>> I'm a bit lost... I think this started to fix a problem on G5 PowerMacs.
>> Do you think this patch solves the problem. Should this be tested?
>> Or is this still work in progress or a general improvement not necessary
>> fixing the problem on G5 PowerMacs?
>
> It started from a report of issues on G5.  The specific issues are
> bugs on G5, and the posted patches do not fix them.  From what I see,
> the timecounter values were wrapped.  This is genuine ppc or G5 issue.

I am still unhappy with the patch (but haven't got around to replying to
the latest version).  It is overengineered and doesn't fix the original
problem.

> The patch fixes time keeping subsystem reaction to the already
> bad situation, by correctly handling overflow in calculations.  This
> overflow can occur in more reasonable setups as well, e.g. if ddb was
> activated and interrupts were stopped for prolonged period, even on x86.

I have noticed that it mostly fixes this.

> In addition, my version of the patch reorganizes the code and removes
> excessive copies of the most delicate loops in lock-less readers. This
> chunk can be split from the overflow part, but it is not completely
> trivial.

This is the part that I don't like.

I noticed (or better realized) a general problem with multiple
timehands.  ntpd can slew the clock at up to 500 ppm, and at least an
old version of it uses a rate of 50 ppm to fix up fairly small drifts
in the milliseconds range.  500 ppm is enormous in CPU cycles -- it is
500 thousand nsec or 2 million cycles at 4GHz.  Winding up the timecounter
every 1 msec reduces this to only 2000 cycles.

More details of ordering and timing for 1 thread:
- thread N calls binuptime() and it loads timehands
- another or even the same thread runs tc_windup().  This modifies timehands.
- thread N is preempted for a long time, but less than the time for
   <number of timehands> updates
- thread N checks the generation count.  Since this is for the timehands
   contents and not for the timehands pointer, it hasn't changed, so the
   old timehands is used
- and instant later, the same thread calls binuptime again() and uses the
   new timehands 
- now suppose only 2 timehands (as in -current) the worst (?) case of a
   slew of 500 ppm for the old timehands and -500 ppm for the new timehands
   and almost the worst case of 10 msec for the oldness of the old timehands
   relative to the new timehands, with the new timehands about to be updated
   after 10 msec (assume perfectly periodiodic updates every 10 msec).  The
   calculated times are:

   old bintime = old_base + (20 msec) * (1 + 500e-6)
   new base = old_base + 10 msec * (1 + 500e-6)    # calc by tc_windup()
   new bintime = new_base + (10 msec) * (1 - 500e-6) + epsilon

   error = epsilon - (20 msec) * 500e-6 = epsilon - 10000 nsec

Errors in the negative direction are most harmful.  ntpd normally doesn't
change the skew as much as that in one step, but it is easy for adjtime(2)
to change the skew like that and there are no reasonable microadjustments
that would accidentally work around this kernel bug (it seems unreasonable
to limit the skew to 1 ppm and that still gives an error of epsilon + 20
nsec.

phk didn't want to slow down timecounters using extra code to make
them them monotonic and coherent (getbinuptime() is incoherent with
binuptime() since it former lags the latter by the update interval),
but this wouldn't be very slow within a thread.

Monotonicity across threads is a larger problem and not helped by using
a faked forced monotonic time within threads.

So it seems best to fix the above problem by moving the generation count
from the timehands contents to the timehands pointer, and maybe also
reduce the number of timehands to 1.  With 2 timehands, this gives a
shorter race:

- thread N loads timehands
- tc_windup()
- thread N preempted
- thread N uses old timehands
- case tc_windup() completes first: no problem -- thread N checks the
   generation count on the pointer and loops
- case binuptime() completes first: lost a race -- binuptime() is off
   by approx <tc_windup() execution time> * <difference in skews>.

The main point of having multiple timehands (after introducing the per-
timehands generation count) is to avoid blocking thread N during the
update, but this doesn't actually work, even for only 2 timehands and 
a global generation count.

Bruce