[RFC/RFT] calloutng

Wed Jan 2 11:24:37 UTC 2013

On 02.01.2013 12:57, Luigi Rizzo wrote:
> On Mon, Dec 31, 2012 at 12:17:27PM +0200, Alexander Motin wrote:
>> On 31.12.2012 08:17, Luigi Rizzo wrote:
>>> On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
> ...
>>>> Then I noticed you had a 12_26 patchset so I tested
>>>> that (after crudely fixing a couple uninitialized var warnings), and it
>>>> all looks good on this arm (Raspberry Pi).  I'll attach the results.
>>>>
>>>> It's so sweet to be able to do precision sleeps.
>>
>> Thank you for testing, Ian.
>>
>>> interesting numbers, but there seems to be some problem in computing
>>> the exact interval; delays are much larger than expected.
>>>
>>> In this test, the original timer code used to round to the next multiple
>>> of 1 tick and then add another tick (except for the kqueue case),
>>> which is exactly what you see in the second set of measurements.
>>>
>>> The calloutng code however seems to do something odd:
>>> in addition to fixed overhead (some 50us, which you can see in
>>> the tests for 1us and 300us), all delay seem to be ~10% larger
>>> than what is requested, upper bounded to 10ms (note, the
>>> numbers are averages so i cannot tell whether all samples are
>>> the same or there is some distribution of values).
>>>
>>> I am not sure if this error is peculiar of the ARM version or also
>>> appears on x86/amd64 but I believe it should be fixed.
>>>
>>> If you look at the results below:
>>>
>>> 1us 	possily ok:
>>> 	for very short intervals i would expect some kind
>>> 	of 'reschedule' without actually firing a timer; maybe
>>> 	50us are what it takes to do a round through the scheduler ?
>>>
>>> 300us	probably ok
>>> 	i guess the extra 50-90us are what it takes to do a round
>>> 	through the scheduler
>>>
>>> 1000us	borderline (this is the case for poll and kqueue, which are
>>> 	rounded to 1ms)
>>> 	here intervals seem to be increased by 10%, and i cannot see
>>> 	a good reason for this (more below).
>>>
>>> 3000us and above: wrong
>>> 	here again, the intervals seem to be 10% larger than what is
>>> 	requested, perhaps limiting the error to 10-20ms.
>>>
>>>
>>> Maybe the 10% extension results from creating a default 'precision'
>>> for legacy calls, but i do not think this is done correctly.
>>>
>>> First of all, if users do not specify a precision themselves, the
>>> automatically generated value should never exceed one tick.
>>>
>>> Second, the only point of a 'precision' parameter is to merge
>>> requests that may be close in time, so if there is already a
>>> timer scheduled within [Treq, Treq+precision] i will get it;
>>> but if there no pending timer, then one should schedule it
>>> for the requested interval.
>>>
>>> Davide/Alexander, any ideas ?
>>
>> All mentioned effects could be explained with implemented logic. 50us at
>> 1us is probably sum of minimal latency of the hardware eventtimer on the
>> specific platform and some software processing overhead (syscall,
>> callout, timecouters, scheduler, etc). At later points system starts to
>> noticeably use precision specified by kern.timecounter.alloweddeviation
>> sysctl. It affects results from two sides: 1) extending intervals for
>> specified percent of time to allow event aggregation, and 2) choosing
>> time base between fast getbinuptime() and precise binuptime(). Extending
>> interval is needed to aggregate not only callouts with each other, but
>> also callouts with other system events, which are impossible to schedule
>> in advance. It gives specified relative error, but no more then one CPU
>> wakeup period in absolute: for busy CPU (not skipping hardclock() ticks)
>> it is 1/hz, for completely idle one it can be up to 0.5s. Second point
>> allows to reduce processing overhead by the cost of error up to 1/hz for
>> long periods (>(100/allowed)*(1/hz)), when it is used.
>
> i am not sure what you mean by "extending interval", but i believe the
> logic should be the following:
>
> - say user requests a timeout after X seconds and with a tolerance of D second
>    (both X and D are fractional, so they can be short).  Interpret this as
>
>     "the system should do its best to generate an event between X and X+D seconds"
>
> - convert X to an absolute time, T_X
>
> - if there are any pending events already scheduled between T_X and T_X+D,
>    then by definition they are acceptable. Attach the requested timeout
>    to the earliest of these events.

All above is true, but not following.

> - otherwise, schedule an event at time T_X (because there is no valid
>    reason to generate a late event, and it makes no sense from an
>    energy saving standpoint, either -- see below).

System may have many interrupts except timer: network, disk, ... WiFi 
cards generate interrupts with AP beacon rate -- dozens times per 
second. It is not very efficient to wake up CPU precisely at T_X time, 
that may be just 100us earlier then next hardware interrupt. That's why 
timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next 
statclock, ...). As result, event will be handled within allowed range, 
but real delay will depends on current environment conditions.

> It seems to me that you are instead extending the requested interval
> upfront, which causes some gratuitous pessimizations in scheduling
> the callout.
>
> Re. energy savings: the gain in extending the timeout cannot exceed
> the value D/X. So while it may make sense to extend a 1us request
> to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s,
> it is completely pointless from an energy saving standpoint to
> introduce a 10ms error on a 300ms request.

I am not so sure in this. When CPU package is in C7 sleep state with all 
buses and caches shut down and memory set to self refresh, it consumes 
very few (some milli-Watts) of power. Wake up from that state takes 
100us or even more with power consumption much higher then normal 
operational one. Sure, if we compare it with power consumption of 100% 
CPU load, difference between 10 and 100 wakeups per second may be small, 
but when comparing to each other in some low-power environment for 
mostly idle system it may be much more significant.

> (even though i hate the idea that a 1us request defaults to
> a 50us delay; but that is hopefully something that can be tuned
> in a platform-specific way and perhaps at runtime).

It is 50us on this ARM. On SandyBridge Core i7 it is only about 2us.

-- 
Alexander Motin