powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Mark Millard
marklmi at yahoo.com
Thu Apr 4 20:16:05 UTC 2019
On 2019-Apr-3, at 08:47, Bruce Evans <brde at optusnet.com.au> wrote:
> . . .
>
> I noticed (or better realized) a general problem with multiple
> timehands. ntpd can slew the clock at up to 500 ppm, and at least an
> old version of it uses a rate of 50 ppm to fix up fairly small drifts
> in the milliseconds range. 500 ppm is enormous in CPU cycles -- it is
> 500 thousand nsec or 2 million cycles at 4GHz. Winding up the timecounter
> every 1 msec reduces this to only 2000 cycles.
>
> More details of ordering and timing for 1 thread:
> - thread N calls binuptime() and it loads timehands
> - another or even the same thread runs tc_windup(). This modifies timehands.
> - thread N is preempted for a long time, but less than the time for
> <number of timehands> updates
> - thread N checks the generation count. Since this is for the timehands
> contents and not for the timehands pointer, it hasn't changed, so the
> old timehands is used
> - and instant later, the same thread calls binuptime again() and uses the
> new timehands - now suppose only 2 timehands (as in -current) the worst (?) case of a
> slew of 500 ppm for the old timehands and -500 ppm for the new timehands
> and almost the worst case of 10 msec for the oldness of the old timehands
> relative to the new timehands, with the new timehands about to be updated
> after 10 msec (assume perfectly periodiodic updates every 10 msec). The
> calculated times are:
>
> old bintime = old_base + (20 msec) * (1 + 500e-6)
> new base = old_base + 10 msec * (1 + 500e-6) # calc by tc_windup()
> new bintime = new_base + (10 msec) * (1 - 500e-6) + epsilon
>
> error = epsilon - (20 msec) * 500e-6 = epsilon - 10000 nsec
>
> Errors in the negative direction are most harmful. ntpd normally doesn't
> change the skew as much as that in one step, but it is easy for adjtime(2)
> to change the skew like that and there are no reasonable microadjustments
> that would accidentally work around this kernel bug (it seems unreasonable
> to limit the skew to 1 ppm and that still gives an error of epsilon + 20
> nsec.
>
> phk didn't want to slow down timecounters using extra code to make
> them them monotonic and coherent (getbinuptime() is incoherent with
> binuptime() since it former lags the latter by the update interval),
> but this wouldn't be very slow within a thread.
>
> Monotonicity across threads is a larger problem and not helped by using
> a faked forced monotonic time within threads.
>
> So it seems best to fix the above problem by moving the generation count
> from the timehands contents to the timehands pointer, and maybe also
> reduce the number of timehands to 1. With 2 timehands, this gives a
> shorter race:
>
> - thread N loads timehands
> - tc_windup()
> - thread N preempted
> - thread N uses old timehands
> - case tc_windup() completes first: no problem -- thread N checks the
> generation count on the pointer and loops
> - case binuptime() completes first: lost a race -- binuptime() is off
> by approx <tc_windup() execution time> * <difference in skews>.
>
> The main point of having multiple timehands (after introducing the per-
> timehands generation count) is to avoid blocking thread N during the
> update, but this doesn't actually work, even for only 2 timehands and a global generation count.
>
Thanks for the description of an example way that sbinuptime and
the like might not give weakly increasing results.
Unfortunately, all the multi-socket contexts that I sometimes have
access to are old PowerMacs. And, currently, the only such context
is the G5 with 2 sockets, 2 cores per socket (powerpc64). So I've
not been able to set up other types of examples to see if problems
repeat.
I do not have access to a single-socket powerpc64 for contrast in
that direction.
One oddity is that the eventtimer's decrementer and timecounter
may change (nearly) together: both change at 33,333,333 Hz, as if
they are tied to the same clock (at least on one socket).
In case it helps with knowing how analogous your investigations
are to the original problem context, I report the following.
If you do not care for such information, stop reading here.
# grep ntpd /etc/rc.conf
ntpd_enable="YES"
ntpd_sync_on_start="YES"
# sysctl kern.eventtimer
kern.eventtimer.periodic: 0
kern.eventtimer.timer: decrementer
kern.eventtimer.idletick: 0
kern.eventtimer.singlemul: 2
kern.eventtimer.choice: decrementer(1000)
kern.eventtimer.et.decrementer.quality: 1000
kern.eventtimer.et.decrementer.frequency: 33333333
kern.eventtimer.et.decrementer.flags: 7
# vmstat -ai | grep decrementer
cpu0:decrementer 4451007 35
cpu3:decrementer 1466010 11
cpu2:decrementer 1481722 12
cpu1:decrementer 1478618 12
(That last is from a basically-idle timeframe.)
# sysctl -a | grep hz
kern.clockrate: { hz = 1000, tick = 1000, profhz = 8128, stathz = 127 }
kern.hz: 1000
# sysctl kern.timecounter
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: timebase(0) dummy(-1000000)
kern.timecounter.hardware: timebase
kern.timecounter.alloweddeviation: 5
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.timebase.quality: 0
kern.timecounter.tc.timebase.frequency: 33333333
kern.timecounter.tc.timebase.counter: 1144662532
kern.timecounter.tc.timebase.mask: 4294967295
(The actual Time Base Register (tbr) i s 64 bits
and freebsd truncates it down.)
# sysctl -a | grep 'cpu.*freq'
device cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
dev.cpufreq.0.%parent: cpu3
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:
dev.cpu.3.freq_levels: 2500/-1 1250/-1
dev.cpu.3.freq: 2500
So 2500 MHz / 33333333 Hz is very near 75 clock periods per
timebase counter value.
I do sometimes have access to a Ryzen Threadripper 1950X based system:
FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 1 package(s) x 2 groups x 2 cache groups x 4 core(s) x 2 hardware threads
but it is single=socket. It has . . .
[It turns out that I've accidentally been running it without
ntpd being enabled in /etc.rc.conf . Just fixed for next boot.]
# sysctl kern.eventtimer kern.eventtimer.periodic: 0
kern.eventtimer.timer: LAPIC
kern.eventtimer.idletick: 0
kern.eventtimer.singlemul: 2
kern.eventtimer.choice: LAPIC(600) HPET(350) HPET1(350) HPET2(350) i8254(100) RTC(0)
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.HPET2.quality: 350
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET1.quality: 350
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET.quality: 350
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.LAPIC.quality: 600
kern.eventtimer.et.LAPIC.frequency: 49907650
kern.eventtimer.et.LAPIC.flags: 7
# vmstat -ai | grep timer
irq0: attimer0 0 0
cpu0:timer 609974 14
cpu1:timer 160546 4
cpu2:timer 99803 2
cpu3:timer 158073 4
cpu4:timer 102870 2
cpu5:timer 168433 4
cpu6:timer 163004 4
cpu7:timer 162947 4
cpu8:timer 163501 4
cpu9:timer 160595 4
cpu10:timer 169100 4
cpu11:timer 163888 4
cpu12:timer 144278 3
cpu13:timer 162706 4
cpu14:timer 164161 4
cpu15:timer 167101 4
cpu16:timer 187732 4
cpu17:timer 189228 4
cpu18:timer 179423 4
cpu19:timer 182701 4
cpu20:timer 139658 3
cpu21:timer 186046 4
cpu22:timer 201248 5
cpu23:timer 184823 4
cpu24:timer 186839 4
cpu25:timer 186008 4
cpu26:timer 191473 4
cpu27:timer 182573 4
cpu28:timer 181213 4
cpu29:timer 197659 5
cpu30:timer 188190 4
cpu31:timer 189092 4
(Again: from a basically-idle timeframe.)
# sysctl -a | grep hz
kern.clockrate: { hz = 1000, tick = 1000, profhz = 8128, stathz = 127 }
kern.hz: 1000
debug.psm.hz: 20
# sysctl -a | grep 'cpu.*freq'
device cpufreq
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.0.%pnpinfo:
dev.cpufreq.0.%location:
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%desc:
dev.cpufreq.%parent:
dev.cpu.0.freq_levels: 3400/-1 2800/-1 2200/-1
dev.cpu.0.freq: 3400
# sysctl kern.timecounter
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 1
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(1000) dummy(-1000000)
kern.timecounter.hardware: TSC-low
kern.timecounter.alloweddeviation: 5
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 3941257163
kern.timecounter.tc.ACPI-fast.mask: 4294967295
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 54249
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.counter: 796737466
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.TSC-low.quality: 1000
kern.timecounter.tc.TSC-low.frequency: 1696860321
kern.timecounter.tc.TSC-low.counter: 3613142695
kern.timecounter.tc.TSC-low.mask: 4294967295
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
More information about the freebsd-ppc
mailing list