ntpd struggling to keep up - how to fix?

Mon Feb 22 11:41:07 UTC 2010

On Mon, Feb 22, 2010 at 10:18:10PM +1100, Peter Jeremy wrote:
> On 2010-Feb-22 01:02:54 -0800, perryh at pluto.rain.com wrote:
> >Peter Jeremy <peterjeremy at acm.org> wrote:
> >
> >> ... Once ntpd decides to continuously step, something is broken.
> >
> >Is there some reason why, as long as it is not yet synced, ntpd
> >should not do this sort of calculation and rate correction itself
> >rather than insist on having a human perform the calculation and
> >enter the adjustment?
> 
> ntpd _does_ do this sort of calculation but the NTP algorithms
> bound the PLL adjustment to +/-500ppm.  RFC1305 suggests that
> a reasonable tolerance for "board-mounted, uncompensated quartz-
> crystal oscillators" is 100ppm and therefore the +/-500ppm bound
> is reasonable (see the RFC for the gory maths).
> 
> In this case, the op's clock was ~2500ppm slow - well outside
> the NTP tolerance.  It was therefore necessary to change the
> nominal timecounter frequency to bring it into lock range.  I
> do not believe it is reasonable for ntpd to do this by itself:
> - It should very rarely be needed since NTP should be able to
>   compensate for normal tolerances.
> - The actual local clock source and how to alter the kernel's
>   idea of its nominal frequency is outside the purview of NTP.
> - Giving ntpd free reign over the timecounter frequency runs
>   the real risk of ntpd rendering the system unusable if ntpd
>   becomes confused (or is mislead) about the time.
> 
> Note that FreeBSD/i386 and /amd64 include 4 different possible
> timecounters, only 3 of which can be tweaked.  Other FreeBSD
> architectures will have different timecounters.  Other OSs may
> have completely different mechanisms for handling the local
> clock source.  Trying to embed knowledge of all these different
> clock sources into ntpd would be unrealistic.
> 
> I look after over 100 assorted Unix hosts at home and work (HP
> AlphaServers and Proliants, various Sun servers, Dell and whitebox PCs
> and various laptops) and the worst driftrates I have seen previously
> are:
> - Sun T-2000 servers have a design flaw in the clock spectrum
>   spreading so it appears to be ~250ppm fast.  Sun fixed this
>   with a kernel patch that increases the nominal clock frequency.
> - A Sun V20z is just over 100ppm out - I have tweaked the
>   relevant timecounter to compensate for this (to avoid triggering
>   my NTP frequency error alarms).
> - 4 assorted Sun hosts that run 55-60ppm out.
> 
> At least based on my sample, the only hosts that were anywhere near
> ntpd's tolerance limits were acknowledged to have a design problem
> and the vendor provided a fix.  IMO, this is a better approach than
> trying to make ntpd omniscient.

A question with regards to the latter systems you mentioned (though I'm
speaking generally and not specifically with regards to those H/W
models), as I want to make sure I understand correctly:

ntpd under normal operation (not +/- 500ppm) "figure out" on its own the
average amount of drift, which is what ntpd.drift is for, correct?

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |