instability of timekeeping

Wed Oct 28 16:50:13 UTC 2015

On 27/10/2015 16:04, Konstantin Belousov wrote:
> On Tue, Oct 27, 2015 at 03:50:01PM +0200, Andriy Gapon wrote:
>> On 27/10/2015 13:58, Konstantin Belousov wrote:
>>> On Tue, Oct 27, 2015 at 11:04:47AM +0200, Andriy Gapon wrote:
>>>> And now another observation. I have C1E option enabled in BIOS. It
>>>> means that if all cores enter C1 state, then the whole processor
>>>> is magically placed into a deep C-state (C3, I think). LAPIC timer
>>>> on this CPU model does not run in the deep C-state. So, I had to
>>>> disable C1E option to test the LAPIC timer in a useful way. But
>>>> before actually testing it I first tried to reproduce the problem. As
>>>> you might have already guessed the problem is gone with that option
>>>> disabled. Scratching my head to understand the implications of this
>>>> observation.
>>>
>>> Most obvious explanation would be that the latency of wakeup is very large.
>>> What is the HPET frequency when the jitter occur ?
>>>
>>
>> kern.timecounter.tc.TSC-low.frequency: 1607351869
>> kern.eventtimer.et.HPET.frequency: 14318180
>>
>> Or did you mean the actual rate of timer interrupts?
> 
> Actual rate, and I used the wrong word.  Since most likely your eventtimer
> is not periodic, I mean the next timer interrupt deadline.
> 

I wasn't sure how to answer your question, so I went ahead and obtained a little
bit of debugging information using KTR.
For a start, here is a patch that adds couple more trace points:
https://people.freebsd.org/~avg/timekeeping-ktr.patch
Please note that the patch has a small bug in hardclock_cnt():
+			CTR1(KTR_SPARE2, "hardclock_cnt at %d: newticks=%d",
+			    newticks);
Two format specifiers but only one argument.  I mention this, so that the trace
is not misinterpreted.

Then, here is a snippet of the trace that seems to be of interest:
https://people.freebsd.org/~avg/timekeeping.ktrdump.txt
I must admit that I have a hard time interpreting what that trace says.
As I understand the KTR timestamps are raw TSC values. So, there is a huge gap
between entries #999543 and #999544.  It's larger than 2 * 2^32, so even after
applying tsc_shift it's still larger that 2^32.

I am not sure how the HPET timer could get programmed for such a long delay.
Or maybe there is some lower level problem that caused the interrupt to be
delayed by (16450821032024 - 16442184278422) / 2 / 1607351869 ≈ 2.687 seconds.
That's quite a long time.

One thing that I can tell is that my earlier hypothesis seems to be wrong as all
raw TSC values are increasing.

Grepping for lines where the timer gets reprogrammed I was surprised to see the
following:
999543   1   16442184278422 load at 1:    next 5042.d25692b3 eq 0
999507   3   16442183843960 load at 3:    next 5042.d23e2146 eq 0
999503   2   16442183825488 load at 2:    next 5042.d25692b3 eq 0
999472   0   16442183792870 load at 0:    next 5042.d23e2146 eq 0
This looks like the next event time has flipped between those two values somehow.

Also, the following seemed a little bit strange:
999508   3   16442183860624 ipi  at 3:    now  5042.d2301033
999467   3   16442183787851 active at 3:  now  5042.d2306108
The later even reports a value of 'now' that is smaller.
I think that that can be explained by the fact that in one case the value is
obtained via sbinuptime() while in the other case it's taken from state->now,
which is set when an actual timer interrupt is received (before sending a
hardclock IPI).

In either case I am going to add a few more trace points in et_start and the
HPET timer code and see if I can catch anything interesting there.

-- 
Andriy Gapon