freebsd 10 kqueue timer regression

Fri Oct 3 16:17:01 UTC 2014

On Fri, 2014-10-03 at 08:50 -0400, John Baldwin wrote:
> On Thursday, October 02, 2014 06:49:49 PM Ian Lepore wrote:
> > On Thu, 2014-10-02 at 16:15 -0600, Ian Lepore wrote:
> > > On Thu, 2014-10-02 at 16:00 -0400, John Baldwin wrote:
> > > > On Thursday, October 02, 2014 3:53:28 pm Ian Lepore wrote:
> > > > > On Thu, 2014-10-02 at 12:47 -0700, Adrian Chadd wrote:
> > > > > > I'm confused; it's doing 50 loops of a 20msec timer, right? So
> > > > > > that's
> > > > 
> > > > 1000ms.
> > > > 
> > > > > Yes, so the entire loop should take 1000ms maybe + 1ms.  Instead it
> > > > > takes 1070.  When I run it on an armv6 system running -current it
> > > > > takes
> > > > > 1050.  When I run it on my 8.4 desktop (pre-eventtimers) it takes
> > > > > 1013.
> > > > > 
> > > > > -- Ian
> > > > 
> > > > What if you set kern.eventtimer.periodic=1?
> > > 
> > > Some interesting results...
> > > 
> > >            HZ   100   500    1000
> > > 
> > > ---------------------------------
> > > periodic=0     1050  1050    1080
> > > periodic=1     1110  1012    1049
> > > 
> > > 
> > > The 1080 number was +/- 3ms, all the other numbers were +/- 1ms (except
> > > for one outlier of 24363 at 100Hz non-periodic which I'm going to
> > > pretend didn't happen).
> > > 
> > > The 1050 numbers are probably each 20ms sleep actually taking 21ms, but
> > > the old tvtohz code with -1 adjustments from the old email thread isn't
> > > in play anymore.  I don't know how to account for the other numbers at
> > > all.  There's all kinds of stuff I don't understand in the new code
> > > involving tick thresholds and such.
> > > 
> > > -- Ian
> > 
> > The attached patch seems to fix the problem in what I think is the most
> > correct way: scheduling the callout with absolute times based on the
> > time the current event was scheduled for plus the requested interval.
> > The net effect should be metronomic events that do not drift (or phase
> > shift if you prefer) over time, regardless of any latency involved in
> > processing the events.
> > 
> > This makes all the numbers in the tests I ran above come out 1000.
> > 
> > It doesn't make me understand the strange results from the prior tests
> > any better.
> > 
> > -- Ian
> 
> Are you running ntpd or ptpd?  If so, perhaps try the original tests without 
> the patch.
> 

Yep, ntpd is running, and I was repeatedly running the test right after
rebooting to change kern.hz.  I realized that's the explanation for the
24-second outlier (which didn't take 24 seconds in reality) -- the
hardware I'm testing on has no RTC, and ntpd must have stepped the clock
24 seconds during the 1 second test loop.

I think ntpd probably explains another anomaly I saw in more testing
after that post last night.  I tested with a 20uS period instead of 20mS
and I kept seeing the loop take ~999810uS.  Initially I thought 190ppm
of clock error was a bit much to explain away easily, then I realized
it's quite likely that ntpd is steering hard to get on-time.  (The
normal drift rate for the timecounter clock on the system is ~18ppm.)

Taking ntpd out of the picture will remove artifacts from its clock
steering, but really then what the test program loop is doing is
measuring a pair of unsteered clocks against each other.  Depending on
the hardware, those clocks may derive from a common oscillator or not
(in which case they are most assuredly keeping time at slightly
different rates, for such is the nature of clocks).

With the hardware I'm testing on, I think I can arrange to have
eventtimers and timecounters both running from the same underlying
oscillator, then we should be truly measuring only the software
behavior.

> That said, I think one of the reasons the old code worked was that the 
> previous callout had the equivalent of the C_HARDCLOCK flag set.  Thus, when 
> the timer interrupt fires and we rescheuled for N ticks, it was actually N 
> ticks - <time since the timer interrupt that triggered this callout> (if that 
> makes sense?).  Now I think what is happening is that that the 
> callout_reset_sbt() is doing 'delta time t + <time since the timer interrupt>' 
> and that that accounts for your numbers.  

I'm lost with this paragraph.  I don't know what you mean by "old
code" (pre-eventtimer?  pre-my-changes?).

I do see stuff in the current code that, under some conditions, mixes in
data involving the prior timecounter tick and also stuff involving
precision values, but I just can't make sense of what it's doing.

> In that case, I wonder if the reason 
> you are seeing such large gaps is that you have Cx states enabled (including 
> C1E) and that those are making that second factor large enough to account for 
> the skew.  Try setting machdep.idle=spin as a test (this is a better test than 
> ntpd/ptpd I think).
> 

I'm testing on ARM hardware with no power management enabled other than
the idle loop executing a wait-for-interrupt instruction that does not
stop the eventtimer and timecounter clocks, and the latency for coming
out of that state is on the order of 1uS.

> The reason I don't like using C_ABSOLUTE is that in theory that will result in 
> longer or shorter intervals if time is adjusted via ntp/ptp, whereas 
> historically EVFILT_TIMER has been CLOCK_UPTIME based instead of 
> CLOCK_REALTIME.
> 

I don't see anything in the code that implies C_ABSOLUTE uses a
different clock.  In fact, it's even documented as being absolute time
since boot.  In callout_reset_sbt_on() if the absolute flag is set the
sbt you pass in is used as-is.  If the flag is not set, the sbt you pass
in is added to either the current uptime or the last tc tick uptime.

For getting a truly periodic timer event, this absolute time technique
is required.  Adding a delta to something that isn't the time the prior
was *scheduled* to occur will result in a phase shift in the event
occurrances over time (phase relative to uptime top-of-second).

-- Ian