CPU time accounting broken on 8-STABLE machine after a few hours of uptime

Tue Sep 28 17:43:27 UTC 2010

On Tue, Sep 28, 2010 at 10:15:34AM -0700, Don Lewis wrote:
> My time source is another FreeBSD box with a GPS receiver on my LAN.  My
> other client machine isn't seeing these time jumps.  The only messages
> from ntp in its log from this period are these:
> 
> Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001
> Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001
> Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001
> Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001

I'm speaking purely about ntpd below this point -- almost certainly a
separate problem/issue, but I'll explain it anyway.  I'm not under the
impression that the calcru messages indicate RTC clock drift, but I'd
need someone like John Baldwin to validate my statement.

Back to ntpd: you can addressing the above messages by adding "maxpoll
9" to your "server" lines in ntp.conf.  The comment we use in our
ntp.conf that documents the well-known problem:

# maxpoll 9 is used to work around PLL/FLL flipping, which happens at
# exactly 1024 seconds (the default maxpoll value).  Another FreeBSD
# user recommended using 9 instead:
# http://lists.freebsd.org/pipermail/freebsd-stable/2006-December/031512.html

> > I don't know if that has any connection to time(1) running slower -- but
> > perhaps ntpd is aggressively adjusting your clock?
> 
> It seems to be pretty stable when the machine is idle:
> 
> % ntpq -c pe
>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> *gw.catspoiler.o .GPS.            1 u    8   64  377    0.168   -0.081   0.007
> 
> Not too much degradation under CPU load:
> 
> % ntpq -c pe
>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> *gw.catspoiler.o .GPS.            1 u   40   64  377    0.166   -0.156   0.026
> 
> I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it
> much, either.
> 
> % ntpq -c pe
>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> *gw.catspoiler.o .GPS.            1 u   35   64  377    0.169   -0.106   0.009

Still speaking purely about ntpd:

The above doesn't indicate a single problem.  The deltas shown in both
delay, offset, and jitter are all 100% legitimate.  A dd (to induce more
interrupt use) isn't going to exacerbate the problem (depending on your
system configuration, IRQ setup, local APIC, etc.).

How about writing a small shell script that runs every minute in a
cronjob that does vmstat -i >> /some/file.log?  Then when you see calcru
messages, look around the time frame where vmstat -i was run.  Look for
high interrupt rates, aside from those associated with cpuX devices.

Next, you need to let ntpd run for quite a bit longer than what you did
above.  Your poll maximum is only 64, indicating ntpd had recently been
restarted, or that your offset deviates greatly (my guess is ntpd being
restarted).  poll will increase over time (64, 128, 256, 512, and
usually max out at 1024), depending on how "stable" the clock is.  when
is a counter that increments, and does clock syncing (if needed) once it
reaches poll.  You'd see unstable system clock indications in your
syslog as well (indicated by actual +/- clock drift lines occurring
regularly.  These aren't the same as 2001/6001 PLL/FLL mode flipping).
Sorry if this is a bit paragraph/much to take in.

You might also try stopping ntpd, removing /var/db/ntpd.drift, and
restarting ntpd -- then check back in about 48 hours (no I'm not
kidding).  This is especially necessary if you've replaced the
motherboard or taken the disks from System A and stuck them in System B.

All that said: I'm not convinced ntpd has anything to do with your
problem.  EIST or EIST-like capabilities (such as Cool'n'Quiet) are
often the source of the problem.  "device cpufreq" might solve your
issue entirely, hard to say.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |