powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Sat Apr 6 12:17:05 UTC 2019

On Sat, Apr 06, 2019 at 02:26:11AM +1100, Bruce Evans wrote:
> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
> 
> > On Sat, Apr 06, 2019 at 01:01:19AM +1100, Bruce Evans wrote:
> >> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
> >>
> >>> On Fri, Apr 05, 2019 at 11:52:27PM +1100, Bruce Evans wrote:
> >>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
> >>>>
> >>>>> On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote:
> >>>>>> I noticed (or better realized) a general problem with multiple
> >>>>>> timehands.  ntpd can slew the clock at up to 500 ppm, and at least an
> >>>>>> old version of it uses a rate of 50 ppm to fix up fairly small drifts
> >>>>>> in the milliseconds range.  500 ppm is enormous in CPU cycles -- it is
> >>>>>> 500 thousand nsec or 2 million cycles at 4GHz.  Winding up the timecounter
> >>>>>> every 1 msec reduces this to only 2000 cycles.
> >>>>>> ...
> >>>>>> The main point of having multiple timehands (after introducing the per-
> >>>>>> timehands generation count) is to avoid blocking thread N during the
> >>>>>> update, but this doesn't actually work, even for only 2 timehands and
> >>>>>> a global generation count.
> >>>>>
> >>>>> You are describing the generic race between reader and writer. The same
> >>>>> race would exist even with one timehand (and/or one global generation
> >>>>> counter), where ntp adjustment might come earlier or later of some
> >>>>> consumer accessing the timehands. If timehand instance was read before
> >>>>> tc_windup() run but code consumed the result after the windup, it might
> >>>>> appear as if time went backward, and this cannot be fixed without either
> >>>>> re-reading the time after time-depended calculations were done and
> >>>>> restarting, or some globabl lock ensuring serialization.
> >>>>
> >>>> With 1 timehand, its generation count would be global.  I think its ordering
> >>>> is strong enough to ensure serialization.
> >>> Yes, single timehands result in global generation.  But it would not solve
> >>> the same race appearing in slightly different manner, as I described above.
> >>> If reader finished while generation number in th was not yet reset, but
> >>> caller uses the result after tc_windup(), the effect is same as if we
> >>> have two th's and reader used the outdated one.
> >>
> >> You described it too concisely for me to understand :-).
> >>
> >> I now see that a single generation count doesn't give serialization.  I
> >> thought that setting the generation to 0 at the start of tc_windup()
> >> serialized the reader and writer.  But all it does is prevent use of the
> >> results of the windup while only some of them are visible.  If the
> >> setting the generation count to 0 doesn't become before tc_windup() reads
> >> the hardware timecounter, then this read may be before other reads using
> >> the old timehands, but it needs to be after.
> > If we have either single th or global gen counter, current code would
> > become serialized, but this is not what I am about.  Lets assume, for
> 
> No, generation counts used like they are now (or in any way that I can
> see) can't give serialization.
> 
> > the sake of the discussion only, that all readers take the same spinlock
> > as tc_windup (i.e. tc_setclock_mtx).
> 
> Spinlocks are far too heavyweight.  Most of the complications in timecounter
> locking are to avoid using them.  But OK for the discussion.
> 
> > It is still possible that reader unlocked the mutex, tc_windup() kicked
> > in, locked the mutex and moved timehands (as you noted, this might
> > even happen on the same CPU), and only then the reader continues. For
> > consumer of bintime() or any other function' result, it looks exactly
> > the same as if we did not serialized with writer but used outdated
> > timehands.
> 
> Not with full serialization.  The writer tc_windup() is also a reader, and
> serializing its read gives the necessary monotonicity (for a single thread):
> - normal reader locks the mutex, reads the timecounter and unlocks.  The
>    mutex makes visible all previous writes, so the reader doesn't use a
>    stale timehands.  Consumers of bintime(), etc., use this time in the past.
> - tc_windup() locks the mutex, reads the timecounter hardware and writes the
>    timecounter soft state.  The new offset is after all previous times read,
>    since this is serialized.
> - normal reader as above sees the new state, so it reads times after the
>    time of the windup, so also after the time of previous normal reads.
> 
> > Let me formulate this diffeently: as far as consumer of the bintime()
> > result does not serialize itself with tc_windup(), serializing bintime()
> > itself against tc_windup() does not close the race, but it is not
> > obvious that the race matters.
> 
> Readers can easily see times far in the past, but the times increase in
> program order.
This is true for the current implementation (single-thread monotonic).

> 
> > Either we should just accept the race as
> > we currently do, or readers must take the spinlock where the exact value
> > of the current time is important,
> 
> Disabling interrupts should be enough.  In my version of 5.2, spinlocks
> don't disable hardware interrupts and may be preempted by fast interrupt
> handlers which may be not so fast and take hundreds of usec.  Actually,
> even disabling interrupts might not be enough.  A single isa bus read
> can take at least 138 usec (when it is behind a DMA queue or something).
> There are also NMI's and SMI's.
> 
> > or readers must re-read the time after
> > doing something important, and redo if the new measuremedtime is outside
> > the acceptable range.
> 
> This method seems to be essential for robustness.
> 
> But I don't see any race (for a single thread and no timecounter skew
> across CPUs).  Sloppy readers just see times an unknown but usually small
> time in the past.  Non-sloppy readers can also defend against timecounter
> skew by binding to 1 CPU.
Yes, this is true.

> 
> Mutex locking of the timecounter doesn't give monotonic times across threads.
> It gives some order, but you don't know which.  Another mutex or rendezvous
> is needed to control the order.
Define monotonic between threads ?