svn commit: r236449 - projects/calloutng/sys/kern

Sat Jun 2 19:28:54 UTC 2012

On Sat, 2 Jun 2012, Alexander Motin wrote:

> On 06/02/12 17:16, Bruce Evans wrote:
>> On Sat, 2 Jun 2012, Davide Italiano wrote:
> ...
>>> Modified: projects/calloutng/sys/kern/kern_timeout.c
>>> ==============================================================================
>>> --- projects/calloutng/sys/kern/kern_timeout.c Sat Jun 2 12:26:14 2012
>>> (r236448)
>>> +++ projects/calloutng/sys/kern/kern_timeout.c Sat Jun 2 13:04:50 2012
>>> (r236449)
>>> @@ -373,9 +373,9 @@ callout_tick(void)
>>> need_softclock = 0;
>>> cc = CC_SELF();
>>> mtx_lock_spin_flags(&cc->cc_lock, MTX_QUIET);
>>> - binuptime(&now);
>>> + getbinuptime(&now);
>>> /*
>>> - * Get binuptime() may be inaccurate and return time up to 1/HZ in
>>> the past.
>>> + * getbinuptime() may be inaccurate and return time up to 1/HZ in the
>>> past.
>>> * In order to avoid the possible loss of one or more events look back
>>> 1/HZ
>>> * in the past from the time we last checked.
>>> */
>> 
>> Up to tc_tick/hz, not up to 1/HZ. tc_tick is the read-only sysctl
>> variable kern.timecounter.tick that is set to make tc_tick/hz as close
>> to 1 msec as possible. If someone asks for busy-waiting by setting
>> HZ to much larger than 1000 and uses this to generate lots of timeouts,
>> they probably get this now, but get*time() cannot be used even to
>> distingish times differing by the timeout granularity. It is hard to
>> see how it could ever work for the above use (timout scheduling with
>> a granularity of ~1/hz when you can only see differences of ~tc_tick/hz,
>> with tc_tick quite often 4-10, or 100-1000 to use or test corner
>> cases??). With a tickless kernel, timeouts wouldn't have a fixed
>> granularity, but you need accurate measurements of times even more.
>> One slow way to get them is to call binuptime() again in the above.
>> Another, even worse way to update timecounters after every timeout
>> expires (the update has a much higher overhead than binuptime(), so
>> this will be very slow iff timeouts that expire are actually used).
>
> I agree with the first part, but could you tell more about tc_windup() 
> complexity? There are a lot of time passed since that code was written, CPUs 
> got faster and I have feeling that cost of that math could reduce and may be 
> not so significant now.

tc_windup() might take relatively less time on faster CPUs, but only if
it is not called more often.  With a tickless kernel, it should be called
less often.  It only needs to be called several times more often than
the hardware timecounter wraps around (1/hz with hz = 100 has a few orders
of magnitude to spare, except with an i8254 timecounter it has at most
a factor of 5 to spare), and perhaps at least once per second for ntp
processing.

> May be at least tc_windup() could be refactored to separate time updating 
> (making it's cost closet to single binuptime() call) and all other fancy 
> (complicated) things? New eventtimers(4) subsystem uses binuptime() a lot. As 
> soon as we already reading relatively slow timecounter hardware, it would be 
> nice to get some benefit from it.

tc_windup() is hard to refactor.  It depends on not being called very
often for its time domain locking to work.  Note that it has no explicit
locking and not even memory access ordering to ensure that its generation
count is ordered relative to the data that it protects.  I'm not sure how
intentional the latter is, and it seems to be too simple to work in all
cases.  The writes to the generation count are:

 	th->th_generation = 0;
 	/* th is now dead, modulo races */
 	// update *th
 	th->th_generation = ogen;
 	/*
 	 * th is now live, modulo races, but is only reachable via very
 	 * old pointers.  See binuptime().  It takes blocking >= 9/hz
 	 * seconds for the generation count count to do anything.
 	 */
 	// irrelevant stuff
 	timehands = th;
 	/*
 	 * th is now live, modulo races.  Now it doesn't take any
 	 * blocking to get the races (just a too-new pointer).
 	 */

This depends on write ordering.  At least amd64 and i386 have strict
write ordering (except for write-combining memory).  When I started
writing this, I thought that the time domain generation was much
stronger.  Just keeping readers 1 generation behind this writer would
give the writes 1-10 msec to become visible.  There are 10 generations
of timehands to handle 9-90 msec of other problems (mainly so that the
window in the above in which the update is in progress is rarely hit).

tc_windup() has large software overheads and complexities.  In many
cases the software overheads are much smaller than the hardware
overheads for a single timecounter hardware read, but if you call
it a lot it will need more locking.  Even mutex locking is considered
too expensive for binuptime(), etc.

I think you don't need more than about 0.0001-0.001% of bintime()'s
normal accuracy for event timers (100000 parts per million instead of
1-10 ppm).  Hardly anyone noticed when the lapic timecounter was
miscalibrated by 10% for several years on several FreeBSD cluster
machines.  This made all timeouts 10% too short or 10% too long.
If a timeout is 10% too long, then there is no way to recover, but if
it is 10% too short then some places in the kernel that use timeouts,
notably nanosleep(), recovers by sleeping for what it thinks is the
remaining time.  This will probably be 10% short too, leaving 1% of
the original timeout remaining.  Eventually this converges to a timeout
only slightly longer than original one.  But most important uses of
timeouts are in device drivers.  I think few or no drivers know that
timeouts may be off by 10% or try to recover from this.  They just

I think the problems are that after a long sleep (or even any CPU
activity that doesn't call a timer function), you don't know what the
time is, and after a short sleep, you don't know if the sleep was
short without calling determining the time accurately.  All interrupts
may have stopped and even timecounter and cpu_ticks() hardware may
have stopped, so normal methods for determining the time might not
work at all.  But if the sleep was short and shallow then cpu_ticks()
probably works across it, and if the sleep was long then determining
the new time precisely after it is a tiny part of resuming everything.

BTW, AFAIK determing the time precisely, and resuming long timeouts,
are quite broken after long sleeps.  I don't care and haven't really
tested this since I don't have any systems that can suspend properly
under FreeBSD.  But short timeouts can be handled reasonably after
a long sleep either by completing them immediately after the sleep
(with some jitter to avoid thundering herds) or by extending them
by the length of the sleep.  I think the latter is what happens now.
It is what breaks long timeouts.

Bruce