powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Fri Apr 5 19:35:09 UTC 2019

On 2019-Apr-5, at 08:26, Bruce Evans <brde optusnet.com.au> wrote:

> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
> 
>> On Sat, Apr 06, 2019 at 01:01:19AM +1100, Bruce Evans wrote:
>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>>> 
>>>> On Fri, Apr 05, 2019 at 11:52:27PM +1100, Bruce Evans wrote:
>>>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>>>>> 
>>>>>> On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote:
>>>>>>> I noticed (or better realized) a general problem with multiple
>>>>>>> timehands.  ntpd can slew the clock at up to 500 ppm, and at least an
>>>>>>> old version of it uses a rate of 50 ppm to fix up fairly small drifts
>>>>>>> in the milliseconds range.  500 ppm is enormous in CPU cycles -- it is
>>>>>>> 500 thousand nsec or 2 million cycles at 4GHz.  Winding up the timecounter
>>>>>>> every 1 msec reduces this to only 2000 cycles.
>>>>>>> ...
>>>>>>> The main point of having multiple timehands (after introducing the per-
>>>>>>> timehands generation count) is to avoid blocking thread N during the
>>>>>>> update, but this doesn't actually work, even for only 2 timehands and
>>>>>>> a global generation count.
>>>>>> 
>>>>>> You are describing the generic race between reader and writer. The same
>>>>>> race would exist even with one timehand (and/or one global generation
>>>>>> counter), where ntp adjustment might come earlier or later of some
>>>>>> consumer accessing the timehands. If timehand instance was read before
>>>>>> tc_windup() run but code consumed the result after the windup, it might
>>>>>> appear as if time went backward, and this cannot be fixed without either
>>>>>> re-reading the time after time-depended calculations were done and
>>>>>> restarting, or some globabl lock ensuring serialization.
>>>>> 
>>>>> With 1 timehand, its generation count would be global.  I think its ordering
>>>>> is strong enough to ensure serialization.
>>>> Yes, single timehands result in global generation.  But it would not solve
>>>> the same race appearing in slightly different manner, as I described above.
>>>> If reader finished while generation number in th was not yet reset, but
>>>> caller uses the result after tc_windup(), the effect is same as if we
>>>> have two th's and reader used the outdated one.
>>> 
>>> You described it too concisely for me to understand :-).
>>> 
>>> I now see that a single generation count doesn't give serialization.  I
>>> thought that setting the generation to 0 at the start of tc_windup()
>>> serialized the reader and writer.  But all it does is prevent use of the
>>> results of the windup while only some of them are visible.  If the
>>> setting the generation count to 0 doesn't become before tc_windup() reads
>>> the hardware timecounter, then this read may be before other reads using
>>> the old timehands, but it needs to be after.
>> If we have either single th or global gen counter, current code would
>> become serialized, but this is not what I am about.  Lets assume, for
> 
> No, generation counts used like they are now (or in any way that I can
> see) can't give serialization.
> 
>> the sake of the discussion only, that all readers take the same spinlock
>> as tc_windup (i.e. tc_setclock_mtx).
> 
> Spinlocks are far too heavyweight.  Most of the complications in timecounter
> locking are to avoid using them.  But OK for the discussion.
> 
>> It is still possible that reader unlocked the mutex, tc_windup() kicked
>> in, locked the mutex and moved timehands (as you noted, this might
>> even happen on the same CPU), and only then the reader continues. For
>> consumer of bintime() or any other function' result, it looks exactly
>> the same as if we did not serialized with writer but used outdated
>> timehands.
> 
> Not with full serialization.  The writer tc_windup() is also a reader, and
> serializing its read gives the necessary monotonicity (for a single thread):
> - normal reader locks the mutex, reads the timecounter and unlocks.  The
>  mutex makes visible all previous writes, so the reader doesn't use a
>  stale timehands.  Consumers of bintime(), etc., use this time in the past.
> - tc_windup() locks the mutex, reads the timecounter hardware and writes the
>  timecounter soft state.  The new offset is after all previous times read,
>  since this is serialized.
> - normal reader as above sees the new state, so it reads times after the
>  time of the windup, so also after the time of previous normal reads.
> 
>> Let me formulate this diffeently: as far as consumer of the bintime()
>> result does not serialize itself with tc_windup(), serializing bintime()
>> itself against tc_windup() does not close the race, but it is not
>> obvious that the race matters.
> 
> Readers can easily see times far in the past, but the times increase in
> program order.
> 
>> Either we should just accept the race as
>> we currently do, or readers must take the spinlock where the exact value
>> of the current time is important,
> 
> Disabling interrupts should be enough.  In my version of 5.2, spinlocks
> don't disable hardware interrupts and may be preempted by fast interrupt
> handlers which may be not so fast and take hundreds of usec.  Actually,
> even disabling interrupts might not be enough.  A single isa bus read
> can take at least 138 usec (when it is behind a DMA queue or something).
> There are also NMI's and SMI's.
> 
>> or readers must re-read the time after
>> doing something important, and redo if the new measuremedtime is outside
>> the acceptable range.
> 
> This method seems to be essential for robustness.
> 
> But I don't see any race (for a single thread and no timecounter skew
> across CPUs).  Sloppy readers just see times an unknown but usually small
> time in the past.  Non-sloppy readers can also defend against timecounter
> skew by binding to 1 CPU.
> 
> Mutex locking of the timecounter doesn't give monotonic times across threads.
> It gives some order, but you don't know which.  Another mutex or rendezvous
> is needed to control the order.
> 

Just for context for the original problem, in case it helps:

The sleepq_timeout went into the case:

       if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo == 0) {
               /*
                * The thread does not want a timeout (yet).
                */

and after that the specific sleep did not try again (deleted?),
thus the hangup for the sleeping thread.

This was with a call backtrace looking like the below
at the time:

0xe00000009af7c730: at sleepq_timeout+0x148
0xe00000009af7c7d0: at softclock_call_cc+0x234
0xe00000009af7c910: at callout_process+0x2e0
0xe00000009af7c9f0: at handleevents+0x22c
0xe00000009af7caa0: at timercb+0x340
0xe00000009af7cba0: at decr_intr+0x140
0xe00000009af7cbd0: at powerpc_interrupt+0x268

(I added a call to cause the backtrace to be reported.)

For this call chain:

timercb gets a "now" value that is passsed along
and into callout_process but not to softclock_call_cc or
sleepq_timeout .

The callout_process is doing CALLOUT_DIRECT handling when
it directly calls softclock_call_cc:

. . .
        /* Iterate callwheel from firstb to nowb and then up to lastb. */
        do {
                sc = &cc->cc_callwheel[firstb & callwheelmask];
                tmp = LIST_FIRST(sc);
                while (tmp != NULL) {
                        /* Run the callout if present time within allowed. */
                        if (tmp->c_time <= now) {
                                /*
                                 * Consumer told us the callout may be run
                                 * directly from hardware interrupt context.
                                 */
                                if (tmp->c_iflags & CALLOUT_DIRECT) {
#ifdef CALLOUT_PROFILING
                                        ++depth_dir;
#endif
                                        cc_exec_next(cc) =
                                            LIST_NEXT(tmp, c_links.le);
                                        cc->cc_bucket = firstb & callwheelmask;
                                        LIST_REMOVE(tmp, c_links.le);
                                        softclock_call_cc(tmp, cc,
#ifdef CALLOUT_PROFILING
                                            &mpcalls_dir, &lockcalls_dir, NULL,
#endif
                                            1);
                                        tmp = cc_exec_next(cc);
                                        cc_exec_next(cc) = NULL;
                                } else {
. . .

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)