powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Konstantin Belousov kostikbel at gmail.com
Fri Apr 5 11:39:23 UTC 2019


On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote:
> I noticed (or better realized) a general problem with multiple
> timehands.  ntpd can slew the clock at up to 500 ppm, and at least an
> old version of it uses a rate of 50 ppm to fix up fairly small drifts
> in the milliseconds range.  500 ppm is enormous in CPU cycles -- it is
> 500 thousand nsec or 2 million cycles at 4GHz.  Winding up the timecounter
> every 1 msec reduces this to only 2000 cycles.
> 
> More details of ordering and timing for 1 thread:
> - thread N calls binuptime() and it loads timehands
> - another or even the same thread runs tc_windup().  This modifies timehands.
> - thread N is preempted for a long time, but less than the time for
>    <number of timehands> updates
> - thread N checks the generation count.  Since this is for the timehands
>    contents and not for the timehands pointer, it hasn't changed, so the
>    old timehands is used
> - and instant later, the same thread calls binuptime again() and uses the
>    new timehands 
> - now suppose only 2 timehands (as in -current) the worst (?) case of a
>    slew of 500 ppm for the old timehands and -500 ppm for the new timehands
>    and almost the worst case of 10 msec for the oldness of the old timehands
>    relative to the new timehands, with the new timehands about to be updated
>    after 10 msec (assume perfectly periodiodic updates every 10 msec).  The
>    calculated times are:
> 
>    old bintime = old_base + (20 msec) * (1 + 500e-6)
>    new base = old_base + 10 msec * (1 + 500e-6)    # calc by tc_windup()
>    new bintime = new_base + (10 msec) * (1 - 500e-6) + epsilon
> 
>    error = epsilon - (20 msec) * 500e-6 = epsilon - 10000 nsec
> 
> Errors in the negative direction are most harmful.  ntpd normally doesn't
> change the skew as much as that in one step, but it is easy for adjtime(2)
> to change the skew like that and there are no reasonable microadjustments
> that would accidentally work around this kernel bug (it seems unreasonable
> to limit the skew to 1 ppm and that still gives an error of epsilon + 20
> nsec.
> 
> phk didn't want to slow down timecounters using extra code to make
> them them monotonic and coherent (getbinuptime() is incoherent with
> binuptime() since it former lags the latter by the update interval),
> but this wouldn't be very slow within a thread.
> 
> Monotonicity across threads is a larger problem and not helped by using
> a faked forced monotonic time within threads.
> 
> So it seems best to fix the above problem by moving the generation count
> from the timehands contents to the timehands pointer, and maybe also
> reduce the number of timehands to 1.  With 2 timehands, this gives a
> shorter race:
> 
> - thread N loads timehands
> - tc_windup()
> - thread N preempted
> - thread N uses old timehands
> - case tc_windup() completes first: no problem -- thread N checks the
>    generation count on the pointer and loops
> - case binuptime() completes first: lost a race -- binuptime() is off
>    by approx <tc_windup() execution time> * <difference in skews>.
> 
> The main point of having multiple timehands (after introducing the per-
> timehands generation count) is to avoid blocking thread N during the
> update, but this doesn't actually work, even for only 2 timehands and 
> a global generation count.

You are describing the generic race between reader and writer. The same
race would exist even with one timehand (and/or one global generation
counter), where ntp adjustment might come earlier or later of some
consumer accessing the timehands. If timehand instance was read before
tc_windup() run but code consumed the result after the windup, it might
appear as if time went backward, and this cannot be fixed without either
re-reading the time after time-depended calculations were done and
restarting, or some globabl lock ensuring serialization.


More information about the freebsd-ppc mailing list