powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Thu Apr 4 20:16:05 UTC 2019

On 2019-Apr-3, at 08:47, Bruce Evans <brde at optusnet.com.au> wrote:

> . . .
> 
> I noticed (or better realized) a general problem with multiple
> timehands.  ntpd can slew the clock at up to 500 ppm, and at least an
> old version of it uses a rate of 50 ppm to fix up fairly small drifts
> in the milliseconds range.  500 ppm is enormous in CPU cycles -- it is
> 500 thousand nsec or 2 million cycles at 4GHz.  Winding up the timecounter
> every 1 msec reduces this to only 2000 cycles.
> 
> More details of ordering and timing for 1 thread:
> - thread N calls binuptime() and it loads timehands
> - another or even the same thread runs tc_windup().  This modifies timehands.
> - thread N is preempted for a long time, but less than the time for
>  <number of timehands> updates
> - thread N checks the generation count.  Since this is for the timehands
>  contents and not for the timehands pointer, it hasn't changed, so the
>  old timehands is used
> - and instant later, the same thread calls binuptime again() and uses the
>  new timehands - now suppose only 2 timehands (as in -current) the worst (?) case of a
>  slew of 500 ppm for the old timehands and -500 ppm for the new timehands
>  and almost the worst case of 10 msec for the oldness of the old timehands
>  relative to the new timehands, with the new timehands about to be updated
>  after 10 msec (assume perfectly periodiodic updates every 10 msec).  The
>  calculated times are:
> 
>  old bintime = old_base + (20 msec) * (1 + 500e-6)
>  new base = old_base + 10 msec * (1 + 500e-6)    # calc by tc_windup()
>  new bintime = new_base + (10 msec) * (1 - 500e-6) + epsilon
> 
>  error = epsilon - (20 msec) * 500e-6 = epsilon - 10000 nsec
> 
> Errors in the negative direction are most harmful.  ntpd normally doesn't
> change the skew as much as that in one step, but it is easy for adjtime(2)
> to change the skew like that and there are no reasonable microadjustments
> that would accidentally work around this kernel bug (it seems unreasonable
> to limit the skew to 1 ppm and that still gives an error of epsilon + 20
> nsec.
> 
> phk didn't want to slow down timecounters using extra code to make
> them them monotonic and coherent (getbinuptime() is incoherent with
> binuptime() since it former lags the latter by the update interval),
> but this wouldn't be very slow within a thread.
> 
> Monotonicity across threads is a larger problem and not helped by using
> a faked forced monotonic time within threads.
> 
> So it seems best to fix the above problem by moving the generation count
> from the timehands contents to the timehands pointer, and maybe also
> reduce the number of timehands to 1.  With 2 timehands, this gives a
> shorter race:
> 
> - thread N loads timehands
> - tc_windup()
> - thread N preempted
> - thread N uses old timehands
> - case tc_windup() completes first: no problem -- thread N checks the
>  generation count on the pointer and loops
> - case binuptime() completes first: lost a race -- binuptime() is off
>  by approx <tc_windup() execution time> * <difference in skews>.
> 
> The main point of having multiple timehands (after introducing the per-
> timehands generation count) is to avoid blocking thread N during the
> update, but this doesn't actually work, even for only 2 timehands and a global generation count.
> 

Thanks for the description of an example way that sbinuptime and
the like might not give weakly increasing results.

Unfortunately, all the multi-socket contexts that I sometimes have
access to are old PowerMacs. And, currently, the only such context
is the G5 with 2 sockets, 2 cores per socket (powerpc64). So I've
not been able to set up other types of examples to see if problems
repeat.

I do not have access to a single-socket powerpc64 for contrast in
that direction.

One oddity is that the eventtimer's decrementer and timecounter
may change (nearly) together: both change at 33,333,333 Hz, as if
they are tied to the same clock (at least on one socket).

In case it helps with knowing how analogous your investigations
are to the original problem context, I report the following.

If you do not care for such information, stop reading here.

# grep ntpd /etc/rc.conf
ntpd_enable="YES"
ntpd_sync_on_start="YES"

# sysctl kern.eventtimer
kern.eventtimer.periodic: 0
kern.eventtimer.timer: decrementer
kern.eventtimer.idletick: 0
kern.eventtimer.singlemul: 2
kern.eventtimer.choice: decrementer(1000)
kern.eventtimer.et.decrementer.quality: 1000
kern.eventtimer.et.decrementer.frequency: 33333333
kern.eventtimer.et.decrementer.flags: 7

# vmstat -ai | grep decrementer
cpu0:decrementer                 4451007         35
cpu3:decrementer                 1466010         11
cpu2:decrementer                 1481722         12
cpu1:decrementer                 1478618         12

(That last is from a basically-idle timeframe.)

# sysctl -a | grep hz
kern.clockrate: { hz = 1000, tick = 1000, profhz = 8128, stathz = 127 }
kern.hz: 1000

# sysctl kern.timecounter
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: timebase(0) dummy(-1000000)
kern.timecounter.hardware: timebase
kern.timecounter.alloweddeviation: 5
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.timebase.quality: 0
kern.timecounter.tc.timebase.frequency: 33333333
kern.timecounter.tc.timebase.counter: 1144662532
kern.timecounter.tc.timebase.mask: 4294967295

(The actual Time Base Register (tbr) i s 64 bits
and freebsd truncates it down.)

# sysctl -a | grep 'cpu.*freq'
device	cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
dev.cpufreq.0.%parent: cpu3
dev.cpufreq.0.%pnpinfo: 
dev.cpufreq.0.%location: 
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc: 
dev.cpufreq.%parent: 
dev.cpu.3.freq_levels: 2500/-1 1250/-1
dev.cpu.3.freq: 2500

So 2500 MHz / 33333333 Hz is very near 75 clock periods per
timebase counter value.

I do sometimes have access to a Ryzen Threadripper 1950X based system:

FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 1 package(s) x 2 groups x 2 cache groups x 4 core(s) x 2 hardware threads

but it is single=socket. It has . . .

[It turns out that I've accidentally been running it without
ntpd being enabled in /etc.rc.conf . Just fixed for next boot.]

# sysctl kern.eventtimer                                                                                                                                                                        kern.eventtimer.periodic: 0
kern.eventtimer.timer: LAPIC
kern.eventtimer.idletick: 0
kern.eventtimer.singlemul: 2
kern.eventtimer.choice: LAPIC(600) HPET(350) HPET1(350) HPET2(350) i8254(100) RTC(0)
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.HPET2.quality: 350
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET1.quality: 350
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET.quality: 350
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.LAPIC.quality: 600
kern.eventtimer.et.LAPIC.frequency: 49907650
kern.eventtimer.et.LAPIC.flags: 7

# vmstat -ai | grep timer
irq0: attimer0                         0          0
cpu0:timer                        609974         14
cpu1:timer                        160546          4
cpu2:timer                         99803          2
cpu3:timer                        158073          4
cpu4:timer                        102870          2
cpu5:timer                        168433          4
cpu6:timer                        163004          4
cpu7:timer                        162947          4
cpu8:timer                        163501          4
cpu9:timer                        160595          4
cpu10:timer                       169100          4
cpu11:timer                       163888          4
cpu12:timer                       144278          3
cpu13:timer                       162706          4
cpu14:timer                       164161          4
cpu15:timer                       167101          4
cpu16:timer                       187732          4
cpu17:timer                       189228          4
cpu18:timer                       179423          4
cpu19:timer                       182701          4
cpu20:timer                       139658          3
cpu21:timer                       186046          4
cpu22:timer                       201248          5
cpu23:timer                       184823          4
cpu24:timer                       186839          4
cpu25:timer                       186008          4
cpu26:timer                       191473          4
cpu27:timer                       182573          4
cpu28:timer                       181213          4
cpu29:timer                       197659          5
cpu30:timer                       188190          4
cpu31:timer                       189092          4

(Again: from a basically-idle timeframe.)

# sysctl -a | grep hz
kern.clockrate: { hz = 1000, tick = 1000, profhz = 8128, stathz = 127 }
kern.hz: 1000
debug.psm.hz: 20

# sysctl -a | grep 'cpu.*freq'
device	cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo: 
dev.cpufreq.0.%location: 
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc: 
dev.cpufreq.%parent: 
dev.cpu.0.freq_levels: 3400/-1 2800/-1 2200/-1
dev.cpu.0.freq: 3400

# sysctl kern.timecounter
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 1
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(1000) dummy(-1000000)
kern.timecounter.hardware: TSC-low
kern.timecounter.alloweddeviation: 5
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 3941257163
kern.timecounter.tc.ACPI-fast.mask: 4294967295
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 54249
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.counter: 796737466
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.TSC-low.quality: 1000
kern.timecounter.tc.TSC-low.frequency: 1696860321
kern.timecounter.tc.TSC-low.counter: 3613142695
kern.timecounter.tc.TSC-low.mask: 4294967295

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)