CPU time accounting broken on 8-STABLE machine after a few hours of uptime

Tue Sep 28 21:00:19 UTC 2010

On 28 Sep, Jeremy Chadwick wrote:
> On Tue, Sep 28, 2010 at 10:15:34AM -0700, Don Lewis wrote:
>> My time source is another FreeBSD box with a GPS receiver on my LAN.  My
>> other client machine isn't seeing these time jumps.  The only messages
>> from ntp in its log from this period are these:
>> 
>> Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001
>> Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001
>> Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001
>> Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001
> 
> I'm speaking purely about ntpd below this point -- almost certainly a
> separate problem/issue, but I'll explain it anyway.  I'm not under the
> impression that the calcru messages indicate RTC clock drift, but I'd
> need someone like John Baldwin to validate my statement.

I don't think the problems are directly related.  I think the calcru
messages get triggered by clcok frequency changes that get detected and
change the tick to usec conversion ratio.

> Back to ntpd: you can addressing the above messages by adding "maxpoll
> 9" to your "server" lines in ntp.conf.  The comment we use in our
> ntp.conf that documents the well-known problem:

Thanks I'll try that.

> # maxpoll 9 is used to work around PLL/FLL flipping, which happens at
> # exactly 1024 seconds (the default maxpoll value).  Another FreeBSD
> # user recommended using 9 instead:
> # http://lists.freebsd.org/pipermail/freebsd-stable/2006-December/031512.html
> 
>> > I don't know if that has any connection to time(1) running slower -- but
>> > perhaps ntpd is aggressively adjusting your clock?
>> 
>> It seems to be pretty stable when the machine is idle:
>> 
>> % ntpq -c pe
>>      remote           refid      st t when poll reach   delay   offset  jitter
>> ==============================================================================
>> *gw.catspoiler.o .GPS.            1 u    8   64  377    0.168   -0.081   0.007
>> 
>> Not too much degradation under CPU load:
>> 
>> % ntpq -c pe
>>      remote           refid      st t when poll reach   delay   offset  jitter
>> ==============================================================================
>> *gw.catspoiler.o .GPS.            1 u   40   64  377    0.166   -0.156   0.026
>> 
>> I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it
>> much, either.
>> 
>> % ntpq -c pe
>>      remote           refid      st t when poll reach   delay   offset  jitter
>> ==============================================================================
>> *gw.catspoiler.o .GPS.            1 u   35   64  377    0.169   -0.106   0.009
> 
> Still speaking purely about ntpd:
> 
> The above doesn't indicate a single problem.  The deltas shown in both
> delay, offset, and jitter are all 100% legitimate.  A dd (to induce more
> interrupt use) isn't going to exacerbate the problem (depending on your
> system configuration, IRQ setup, local APIC, etc.).

I was hoping to do something to provoke clock interrupt loss.  I don't
see any problems when this machine is idle.  The last two times that the
calcru messages have occured where when I booted this machine to build a
bunch of ports.

I don't see any problems when this machine is idle.  Offset and jitter
always look really good whenever I've looked.

> How about writing a small shell script that runs every minute in a
> cronjob that does vmstat -i >> /some/file.log?  Then when you see calcru
> messages, look around the time frame where vmstat -i was run.  Look for
> high interrupt rates, aside from those associated with cpuX devices.

Ok, I'll give this a try.  Just for reference, this is what is currently
reported:
% vmstat -i
interrupt                          total       rate
irq0: clk                       60683442       1000
irq1: atkbd0                           6          0
irq8: rtc                        7765537        127
irq9: acpi0                           13          0
irq10: ohci0 ehci1+             10275064        169
irq11: fwohci0 ahc+               132133          2
irq12: psm0                           21          0
irq14: ata0                        90982          1
irq15: nfe0 ata1                   18363          0

I'm not sure why I'm getting USB interrupts.  There aren't any USB
devices plugged into this machine.

# usbconfig dump_info
ugen0.1: <OHCI root HUB nVidia> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON

ugen1.1: <EHCI root HUB nVidia> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON

ugen2.1: <OHCI root HUB nVidia> at usbus2, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON

ugen3.1: <EHCI root HUB nVidia> at usbus3, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON

> Next, you need to let ntpd run for quite a bit longer than what you did
> above.  Your poll maximum is only 64, indicating ntpd had recently been
> restarted, or that your offset deviates greatly (my guess is ntpd being
> restarted).  poll will increase over time (64, 128, 256, 512, and
> usually max out at 1024), depending on how "stable" the clock is.  when
> is a counter that increments, and does clock syncing (if needed) once it
> reaches poll.  You'd see unstable system clock indications in your
> syslog as well (indicated by actual +/- clock drift lines occurring
> regularly.  These aren't the same as 2001/6001 PLL/FLL mode flipping).
> Sorry if this is a bit paragraph/much to take in.

Yes, these readings were shortly after I rebooted the machine.  It's
been up a while longer now:

% uptime
 1:48PM  up 16:44, 2 users, load averages: 0.00, 0.00, 0.00
% ntpq -c pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*gw.catspoiler.o .GPS.            1 u  179  256  377    0.159    0.202   0.053

> You might also try stopping ntpd, removing /var/db/ntpd.drift, and
> restarting ntpd -- then check back in about 48 hours (no I'm not
> kidding).  This is especially necessary if you've replaced the
> motherboard or taken the disks from System A and stuck them in System B.

I don't think the problem is the drift file.  Most of the time ntp is
really stable.  I haven't seen any indication of clock drift that might
be causing ntp to step the clock, but I haven't happened to look in the
midst of port building.

> All that said: I'm not convinced ntpd has anything to do with your
> problem.  EIST or EIST-like capabilities (such as Cool'n'Quiet) are
> often the source of the problem.  "device cpufreq" might solve your
> issue entirely, hard to say.

My kernel config includes GENERIC, which contains cpufreq.  Also,
kern.timecounter.hardware is ACPI-fast, which shouldn't be affected by
CPU clock speed changes.

This shows up in dmesg:
	powernow0: <PowerNow! K8> on cpu0