CPU time accounting broken on 8-STABLE machine after a few
hours of uptime
Don Lewis
truckman at FreeBSD.org
Tue Sep 28 21:00:19 UTC 2010
On 28 Sep, Jeremy Chadwick wrote:
> On Tue, Sep 28, 2010 at 10:15:34AM -0700, Don Lewis wrote:
>> My time source is another FreeBSD box with a GPS receiver on my LAN. My
>> other client machine isn't seeing these time jumps. The only messages
>> from ntp in its log from this period are these:
>>
>> Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001
>> Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001
>> Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001
>> Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001
>
> I'm speaking purely about ntpd below this point -- almost certainly a
> separate problem/issue, but I'll explain it anyway. I'm not under the
> impression that the calcru messages indicate RTC clock drift, but I'd
> need someone like John Baldwin to validate my statement.
I don't think the problems are directly related. I think the calcru
messages get triggered by clcok frequency changes that get detected and
change the tick to usec conversion ratio.
> Back to ntpd: you can addressing the above messages by adding "maxpoll
> 9" to your "server" lines in ntp.conf. The comment we use in our
> ntp.conf that documents the well-known problem:
Thanks I'll try that.
> # maxpoll 9 is used to work around PLL/FLL flipping, which happens at
> # exactly 1024 seconds (the default maxpoll value). Another FreeBSD
> # user recommended using 9 instead:
> # http://lists.freebsd.org/pipermail/freebsd-stable/2006-December/031512.html
>
>> > I don't know if that has any connection to time(1) running slower -- but
>> > perhaps ntpd is aggressively adjusting your clock?
>>
>> It seems to be pretty stable when the machine is idle:
>>
>> % ntpq -c pe
>> remote refid st t when poll reach delay offset jitter
>> ==============================================================================
>> *gw.catspoiler.o .GPS. 1 u 8 64 377 0.168 -0.081 0.007
>>
>> Not too much degradation under CPU load:
>>
>> % ntpq -c pe
>> remote refid st t when poll reach delay offset jitter
>> ==============================================================================
>> *gw.catspoiler.o .GPS. 1 u 40 64 377 0.166 -0.156 0.026
>>
>> I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it
>> much, either.
>>
>> % ntpq -c pe
>> remote refid st t when poll reach delay offset jitter
>> ==============================================================================
>> *gw.catspoiler.o .GPS. 1 u 35 64 377 0.169 -0.106 0.009
>
> Still speaking purely about ntpd:
>
> The above doesn't indicate a single problem. The deltas shown in both
> delay, offset, and jitter are all 100% legitimate. A dd (to induce more
> interrupt use) isn't going to exacerbate the problem (depending on your
> system configuration, IRQ setup, local APIC, etc.).
I was hoping to do something to provoke clock interrupt loss. I don't
see any problems when this machine is idle. The last two times that the
calcru messages have occured where when I booted this machine to build a
bunch of ports.
I don't see any problems when this machine is idle. Offset and jitter
always look really good whenever I've looked.
> How about writing a small shell script that runs every minute in a
> cronjob that does vmstat -i >> /some/file.log? Then when you see calcru
> messages, look around the time frame where vmstat -i was run. Look for
> high interrupt rates, aside from those associated with cpuX devices.
Ok, I'll give this a try. Just for reference, this is what is currently
reported:
% vmstat -i
interrupt total rate
irq0: clk 60683442 1000
irq1: atkbd0 6 0
irq8: rtc 7765537 127
irq9: acpi0 13 0
irq10: ohci0 ehci1+ 10275064 169
irq11: fwohci0 ahc+ 132133 2
irq12: psm0 21 0
irq14: ata0 90982 1
irq15: nfe0 ata1 18363 0
I'm not sure why I'm getting USB interrupts. There aren't any USB
devices plugged into this machine.
# usbconfig dump_info
ugen0.1: <OHCI root HUB nVidia> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON
ugen1.1: <EHCI root HUB nVidia> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON
ugen2.1: <OHCI root HUB nVidia> at usbus2, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON
ugen3.1: <EHCI root HUB nVidia> at usbus3, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON
> Next, you need to let ntpd run for quite a bit longer than what you did
> above. Your poll maximum is only 64, indicating ntpd had recently been
> restarted, or that your offset deviates greatly (my guess is ntpd being
> restarted). poll will increase over time (64, 128, 256, 512, and
> usually max out at 1024), depending on how "stable" the clock is. when
> is a counter that increments, and does clock syncing (if needed) once it
> reaches poll. You'd see unstable system clock indications in your
> syslog as well (indicated by actual +/- clock drift lines occurring
> regularly. These aren't the same as 2001/6001 PLL/FLL mode flipping).
> Sorry if this is a bit paragraph/much to take in.
Yes, these readings were shortly after I rebooted the machine. It's
been up a while longer now:
% uptime
1:48PM up 16:44, 2 users, load averages: 0.00, 0.00, 0.00
% ntpq -c pe
remote refid st t when poll reach delay offset jitter
==============================================================================
*gw.catspoiler.o .GPS. 1 u 179 256 377 0.159 0.202 0.053
> You might also try stopping ntpd, removing /var/db/ntpd.drift, and
> restarting ntpd -- then check back in about 48 hours (no I'm not
> kidding). This is especially necessary if you've replaced the
> motherboard or taken the disks from System A and stuck them in System B.
I don't think the problem is the drift file. Most of the time ntp is
really stable. I haven't seen any indication of clock drift that might
be causing ntp to step the clock, but I haven't happened to look in the
midst of port building.
> All that said: I'm not convinced ntpd has anything to do with your
> problem. EIST or EIST-like capabilities (such as Cool'n'Quiet) are
> often the source of the problem. "device cpufreq" might solve your
> issue entirely, hard to say.
My kernel config includes GENERIC, which contains cpufreq. Also,
kern.timecounter.hardware is ACPI-fast, which shouldn't be affected by
CPU clock speed changes.
This shows up in dmesg:
powernow0: <PowerNow! K8> on cpu0
More information about the freebsd-stable
mailing list