instability of timekeeping

Wed Oct 21 08:43:18 UTC 2015

On 20/10/2015 14:05, Andriy Gapon wrote:
> I performed a small observation.  With ntpd disabled I ran `ntpdate -d` at 10
> second intervals in a loop (done via `sleep 10`).  It looks like for about 25
> minutes the time offset between a reference server and my machine was quite
> stable. But then it sort of jumped about 2.5 seconds between two consecutive
> ntpdate invocations.
[snip]
> On 20/10/2015 13:12, Andriy Gapon wrote:
[snip]
>> kern.timecounter.tc.TSC-low.frequency: 1607357461
>> kern.timecounter.tc.TSC-low.counter: 2457319922
>> kern.timecounter.tc.TSC-low.mask: 4294967295
[snip]

Another observation and a hypothesis.

I tried time counters other than TSC and I couldn't reproduce the issue with them.

Another thing that occurred to me is that TSC-low.mask / TSC-low.frequency ≈ 2.7
seconds.

>From these observations and reading some code comes my hypothesis.
First, I assume that visible values of TSCs on different cores are not perfectly
synchronized. Second, I think that there can be circumstances where
tc_ticktock() -> tc_windup() can get called on different cores sufficiently
close in time that the later call would see TSC value which is "before"
(sequentially smaller) than the TSC value read earlier (on the other core).  In
that case the delta between the readings would be close to TSC-low.mask.

Right now I do not have any proof that what the hypothesis says is what actually
happens.  On the other hand, I do not see anything that would prevent the
hypothesized situation from occurring.

To add more weight to the hypothesis:
cpu1:invltlb                      674384          4
cpu1:invlrng                      453020          3
cpu1:invlpg                    108772578        652
cpu1:preempt                    37435000        224
cpu1:ast                           36757          0
cpu1:rendezvous                     9473          0
cpu1:hardclock                  22267434        133
As you can see I am currently running workloads that result in a very
significant number of IPIs, especially the page invalidation IPIs.  Given that
all (x86) IPIs have the same priority (based on their vector numbers) I think
it's plausible that the hardclock IPI could get arbitrarily delayed.

I guess I could add a tracepoint to record deltas that are close to a current
timecounter's mask.

Assuming the hypothesis is correct I see two possible ways to work-around the
problem:

1. Increase tsc_shift, so that the cross-CPU TSC differences are smaller (at the
cost of lower resolution).  That should reduce the risk of seeing "backwards"
TSC values.  Judging from numbers that tools/tools/tscdrift produces I can set
tsc_shift to 7.  The resulting resolution should not be worse than that of HPET
or ACPI-fast counters with the benefit of TSC being much faster to read.

2. Change the code, so that tc_windup() is always called on the same CPU.  E.g.
it could be the BSP or a CPU that receives the actual timer interrupts (as
opposed to the hardclock IPIs).  This should help with the timekeeping, but
won't help with the "jitter" in binuptime() and friends.

3. In tc_delta() somehow detect and filter out "slightly behind" timecounter
readings.  Not sure if this is possible at all.

-- 
Andriy Gapon