kern/79339: [patch] Kernel time code sync with improvements from DragonFly

Thu Mar 31 02:50:58 PST 2005

On Thu, 31 Mar 2005, Uwe Doering wrote:

> Joshua Coombs wrote:
>>  Testing with wakeup_latency.c on a 5.3-Rel box shows the same symptom set. 
>> I've not yet tested the proposed fix on 5-x.  I will try dupilcating this 
>> issue on 6-current as well to nail down the problem scope. 
>
> Please also look at what's actually in DragonFly's CVS repository.  Your PR 
> is based on the original patch, while the code in DragonFly is more 
> sophisticated.  Namely, tvtohz() was split into two functions, tvtohz_low() 
> and tvtohz_high(), which replace the original function depending on the 
> context tvtohz() appears in.
>
> From this I conclude that the original patch is insufficient (likely to break 
> parts of the kernel), and that integrating this improvement into FreeBSD 
> might not be as easy and straightforward as it appears to be at first glance. 
> On the other hand, with some effort it ought to be doable.

Indeed.

Here is a discussion of some of the bugs in the patch:

% >Fix:
% /usr/src/sys/kern/kern_clock.c
% 325c325
% <                       / tick + 1;
% ---
% >                       / tick;
% 328c328
% <                       + ((unsigned long)usec + (tick - 1)) / tick + 1;
% ---
% >                       + ((unsigned long)usec + (tick - 1)) / tick;

This breaks all callers of tvtohz() except the one that is changed in
the patch to expect this API change.  The comment before tvtohz() still
says that tvtohz() adds 1.

% /usr/src/sys/kern/kern_time.c
% 232c232
% <       int error;
% ---
% >       int error, sleepticks;
% 241a242
% >                 sleepticks = tvtohz(&tv);
% 243c244
% <                   tvtohz(&tv));
% ---
% >                     (sleepticks < 1)? 1 : sleepticks);

This is more or less correct.  1 should be subtracted from tvtohz() in
callers that do a careful comparision of the times before and after
the sleep so that they can tell if the sleep time has completely
expired.

The function here (nanosleep1()) is not quite such a caller.  It does
a sloppy comparision of times, using getnanouptime() instead of
nanouptime().  getnanouptime() has a resolution of 1/ticktock_hz, where
ticktock_hz is appoximately min(hz, 1000) (normally just hz), so there
is a possible error of 2/ticktock_hz in the comparision.  I think all
the errors go the same way, so the maximum error is 1/ticktock_hz.
The extra tick added by tvtohz() accidentally compensates for this
error.  Synchronization effects may reduce (or increase?) the error.
The first getnanouptime() is unsynchronized, but ones done just after
timeout returns are synced with clock interrupts, so they give a
fairly accurate time every hz/ticktock_hz hardclock interrupts.
Anyway, if 1 is subtracted from tztvohz(), then naouptime() should
be used to avoid these errors.

There are many other callers like nanosleep1(): the ones for select(2),
poll(2) and setitimer(2).  These all depend on tvtohz() adding 1 to
ensure that they sleep for the specified interval, and they all do
sloppy comparisions like nanosleep1(), so they all need similar changes
if you want timeouts to be synchronized with 1/HZ second boundaries as
perfectly as possible.

% 252c253,254
% <                               *rmt = ts;
% ---
% >                                 rmt->tv_sec = ts.tv_sec;
% >                                 rmt->tv_nsec = ts.tv_nsec;
% 258c260,261
% <               ts3 = ts;
% ---
% >                 ts3.tv_sec = ts.tv_sec;
% >                 ts3.tv_nsec = ts.tv_nsec;

These changes just introduce style bugs.

% 260a264,265
% >                 if (tv.tv_sec == 0 && tv.tv_usec < tick)
% >                         return (0);

This can't be right.  We have just not-so-carefully checked whether
the time has expired, and only get here when it hasn't.
(tv.tv_sec == 0 && tv.tv_usec < tick) means that we would have preferred
the sleep time to be less than 1 tick.  We had to request a sleep of
exactly 1 tick because less than 1 is impossible (this is with 1
subtracted from tvtohz()).  Sleeping for exactly 1 tick is also
impossible, so we have woken up after an interval of anywhere between
0+epsilon and (1-epsilon+latency) seconds.  The interval may be
significantly smaller or larger than than `tv' and we must go back to
sleep if it is smaller.  The above change breaks this.

I think the problem that this change is supposed to fix is related to
the tick frequency not being an exact multiple of 1/HZ.  Also, to avoid
sleeping longer than necessary, we should try to wake up 1 tick early
and then decide whether to sleep another tick or 2 to finish.  Note
that although tvtohz() always rounds up, physical sleep intervals are
always shorter than the specified timeout, so waking up 1 tick early
is very common for unsynchonized sleeps.  Thus if we subtract 1 from
tvtohz(), we often wake up 1 tick early as a side effect, which is what
we want, but there is a problem: suppose that that everything is in
perfect sync, but the hardclock interrupt frequency is slightly less
than 1/HZ seconds.  Then we may wake up 5 usec or so early and decide
to go back to sleep, giving a large error.  Changes later in the patch
are related to this.  I think we shouldn't do anything special here
except possibly return early if `tv' is very small.

Going around the loop in nanosleep1() an extra time is a small
pessimization.  Using nanouptime() to get the decision of whether to
loop right is a pessimization too, but it is relatively small.

% /usr/src/sys/i386/isa/clock.c
% 113c113,114
% < #define       TIMER_DIV(x) ((timer_freq + (x) / 2) / (x))
% ---
% > #define TIMER_DIV(x) (timer_freq / (x))
% > #define FRAC_ADJUST(x) (timer_freq - ((timer freq / (x)) * (x)))

Reducing TIMER_DIV() unconditionally would be harmless under FreeBSD.
It's rounding to nearest dates from there was little more than hardclock
ticks for timekeeping.  Now HZ and the hardclock interrupt frequency
are almost unrelated to timekeeping.

% 141a143
% > u_int   timer0_frac_freq;
% 204a207,209
% >         int phase;
% >         int delta;
% >
% 215a221,236
% >
% >         phase = 1000000 / timer0_frac_freq;
% >         delta = timecounter->tc_microtime.tv_usec % phase;

tc_microtime.tv_usec is not quite the right thing to use here.  It is
updated every tick or two so it might be up to date, but it has
unnecessary jitter.  microtime() would give a more accurate timestamp.
I think microtime() and not microuptime() is the correct function to
use here, since we want to sync with the real time.  OTOH, nanosleep1()
and friends use the uptime, so they must be looked at some more to
determine the effects of using different time scales on syncing.  I
think the synchronization done here is honored by nanosleep1() despite
the different scales, and sync is only lost when the clock is changed
using settimeofday() (then everything gets out of sync).

% > #if 1
% >       disable_intr();

The clock should be read inside this critical section.

% >         if (delta < (phase >> 1)) {
% >                 outb(TIMER_CNTR0, timer0_max_count & 0xff);
% >                 outb(TIMER_CNTR0, timer0_max_count >> 8);
% >         } else {
% >                 outb(TIMER_CNTR0, (timer0_max_count +1) & 0xff);
% >                 outb(TIMER_CNTR0, (timer0_max_count +1) >> 8);
% >                 ++i8254_offset;
% >         }

I think i8254_offset needs to be reinitialized every time the maximum
count is reprogrammed.  This is not done in set_timer_freq(); however,
most callers of set_timer_freq() initialize or update the i8254
timecounter immediately after, and testing shows that this reduces
lost ticks to an acceptable value (usually, and hopefully always < 10).
Correctly reprogramming the i8254 on every interrupt is harder.  Losing
even 1 tick per interrupt is too much, but I think the above can
sometimes lose 100 (if clkintr() is delayed for that long, which can
easily happen especially in RELENG_4 since clkintr() is not a fast
interrupt handler there).  See nearby code that calls
i8254_get_timecount() inside a critical section for a way to reduce
the error to at most 5 ticks.  It takes about 5 ticks just to read the
counter.  This is still far too large to do on every clock tick.  All
of this only matters if the i8254 is used for timekeeping.

% >       enable_intr();
% > #endif
% >
% 236a258
% >                 timer0_frac_freq = new_rate;
% 247,248c269,270
% <               if ((timer0_prescaler_count += timer0_max_count)
% <                   >= hardclock_max_count) {
% ---
% >                 timer0_prescaler_count += timer0_max_count;
% >                 if (timer0_prescaler_count >= hardclock_max_count) {

This change is just to style.

% 689a712
% >         timer0_frac_freq = intr_freq;

The changes seem to be too simple to give a PLL.  I didn't check the details
for this.

% 1221c1244
% <       count = timer0_max_count - ((high << 8) | low);
% ---
% >         count = timer0_max_count + 1 - ((high << 8) | low);

Always adding 1 here seems to be wrong.  Shouldn't you only add 1 if
timer0_max_count isn't actually the max count, i.e., when the max count
has been programmed to be 1 more than usual?  All references to
timer0_max_count are potentially wrong when timer0_max_count isn't
actually the max count.  You add 1 to i8254_offset in the above; this
seems to be to adjust for 1 of the references being wrong, but it doesn't
seem to adjust for `count' being 1 too large.

% A sawtooth is still present, but the accuracy is MUCH better.  I suspect my hack application of the PLL function isn't correct or my P133 is slow enough that I'm observing some other latencies.  I have observed occasional negative offsets, which according to the article are strictly forbidden by RFCs, so please check my work.  I believe they were the result of my playing with a hz value too high for the machine to reasonably handle, and are not occuring with saner values for hz.

I only agree with the non-hardware changes (don't sleep for an extra
tick in nanosleep1() and friends if this is easy to avoid).  All that
that perfect sync of real time with hardclock() clock gives is the
possibility of waking up on precisely 1/HZ boundaries relative to real
time (with whole seconds being boundaries).  System activity lengthens
sleeps by indeterminate amounts except on unloaded systems.  The average
error for a random sleep on an unloaded systems would still be 0.5/HZ
(or 1.5/HZ without the nanosleep1() change).

Bruce