Re: [RFC/RFT] calloutng

From: Davide Italiano <>
Date: Fri, 14 Dec 2012 13:57:36 +0100
On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo <> wrote:
> On Fri, Dec 14, 2012 at 12:12 AM, Davide Italiano <>
> wrote:
>> Hi.
>> This patch takes callout(9) and redesign the KPI and the
>> implementation. The main objective of this work is making the
>> subsystem tickless.  In the last several years, this possibility has
>> been discussed widely (,
>> but until now noone really implemented that.
>> If you want a complete history of what has been done in the last
>> months you can check the calloutng project repository
>> For lazy people, here's a summary:
> thanks for the work and the detailed summary.
> Perhaps it would be useful if you could provide a few high level
> details on the use and performance of the new scheme, such as:
> - is the old callout KPI still available ? (i am asking because it would
>   help maintaining third party kernel modules that are expected to
>   work on different FreeBSD releases)

Obviously the old KPI is still available. callout(9) is a very popular
interface and I don't think removing the old interface is a good idea,
because could make unhappy some vendor when its code doesn't build
anymore on FreeBSD.

> - do you have numbers on what is the fastest rate at which callouts
>   can be fired (e.g. say you have a callout which increments a
>   counter and schedules the next callout in (struct bintime){0,1} ) ?
> - is there a possibility that if callout requests are too close to each
>   other  (e.g. the above test) the thread dispatching callouts will
>   run forever ? if so, is there a way to make such thread yield
>   after a while ?
> - since you mentioned nanosleep() poll() and select() have been
>   ported to the new callout, is there a way to guarantee that user
>   using these functions with a very short timeout are actually
>   descheduled as opposed to "interval too short, don't bother" ?
> - do you have numbers on how many calls per second we can
>   have for a process that does
>       for (;;) {  nanosleep(min_value_that_causes_descheduling);
> I also have some comments on the diff:
> - can you provide a diff -p ?
> - for several functions the only change is the name of an argument
>   from "busy" to "us". Can you elaborate the reason for the change,
>   and whether "us" means microseconds or the pronoun ?)

Please see r242905 by mav_at_.

> Finally, a more substantial comment:
> - a lot of functions which formerly had only a "timo" argument
>   now have "timo, bt, precision, flags". Take seltdwait() as an example.

seltdwait() is not part of the public KPI. It has been modified to
avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e.
two separate functions, even though we could share most of the code is
not a clever approach, IMHO.
As I told before, seltdwait() is not exposed so we can modify its
argument without breaking anything.

>   It seems that you have been undecided between two approaches:
>   for some of these functions you have preserved the original function
>   that deals with ticks and introduced a new one that deals with the
> bintime,
>   whereas in other cases you have modified the original function to add
>   "bt, precision, flags".

I'm not. All the functions which are part of the public KPI (e.g.
condvar(9), sleepq(9), *) are still available.  *_flags variants have
been introduced so that consumers can take advantage of the new
'precision tolerance mechanism' implemented. Also, *_bt variants have
been introduced. I don't see any "undecision" between the two
Please note that now the callout backend deals with bintime, so every
time callout_reset_on() is called, the 'tick' argument passed is
silently converted to bintime.

>   I would suggest a more uniform approach, namely:
>   - preserve all the existing functions (T) that take a timeout in ticks;
>   - add a new set of corresponding functions (BT) that take
>     bt, precision, flags _instead_ of the ticks
>   - the functions (T) make immediately the conversion from ticks to
>     bintime(s), using macros or inline
>   - optionally, convert kernel components to the new (BT) functions
>     where this makes sense (e.g. we can exploit the finer-granularity
>     of the new calls, etc.)

> cheers
> luigi
>  1) callout(9) is not anymore constrained to the resolution a periodic
>> "hz" clock can give. In order to do that, the eventtimers(4) subsystem
>> is used as backend.
>> 2) Conversely from what discussed in past, we maintained the callwheel
>> as underlying data structure for keeping track of the outstading
>> timeouts. This choice has a couple of advantages, in particular we can
>> still take benefits from the O(1) average complexity of the wheel for
>> all the operations. Also, we thought the code duplication that would
>> arise from the use of a two-staged backend for callout (e.g. use wheel
>> for coarse resolution event and another data structure, such as an
>> heap for high resolution events), is unacceptable. In fact, as long as
>> callout gained the ability to migrate from a cpu to another having a
>> double backend would mean doubling the code for the migration path.
>> 3) A way to dispatch interrupts from hardware interrupt context has
>> been implemented, using special callout flag. This has limited
>> applicability, but avoid the dispatching of a SWI thread for handling
>> specific callouts, avoiding the wake up of another CPU for processing
>> and a (relatively useless) context switch
>> 4) As long as new callout mechanism deals with bintime and not anymore
>> with ticks, time is specified as absolute and not relative anymore. In
>> order to get current time binuptime() or getbinuptime() is used, and a
>> sysctl is introduced to selectively choose the function to use, based
>> on a precision threshold.
>> 5) A mechanism for specifying precision tolerance has been
>> implemented. The callout processing mechanism has been adapted and the
>> callout data structure augmented so that the codepath can take
>> advantage and aggregate events which overlap in time.
>> The new proposed KPI for callout is the following:
>> callout_reset_bt_on(..., struct bintime time, struct bintime pr, ..., int
>> flags)
>> where ‘time’ argument represets the time at which the callout should
>> fire, ‘pr’ represents the precision tolerance expressed as an absolute
>> value, and ‘flags’, which could be used to specify new features, i.e.
>> for now, the possibility to run the callout from fast interrupt
>> context.
>> The old KPI has been extended introducing the callout_reset_flags()
>> function, which is the same of callout_reset*(), but takes an
>> additional argument ‘int flags’ that can be used in the same fashion
>> of the ‘flags’ argument for the new KPI. Using the ‘flags’ consumers
>> can also specify relative precision tolerance in terms of power-of-two
>> portion of the timeout passed as ticks.
>> Using this strategy, the new precision mechanism can be used for the
>> existing services without major modifications.
>> Some consumers have been ported to the new KPI, in particular
>> nanosleep(), poll(), select(), because they take immediate advantage
>> from the arbitrary precision offered by the new infrastructure.
>> For some statistics about the outcome of the conversion to the new
>> service, please refer to the end of this e-mail:
>> We didn't measure any significant performance regressions with
>> hwmpc(4), using some benckmarks programs:
>> We tested the code on amd64, MIPS and arm. Any kind of testing or
>> comment would be really appreciated. The full diff of the work against
>> HEAD can be found at:
>> If noone have objections, we plan to merge the repository to HEAD in a
>> week or so.
>> Thanks,
>> Davide
>> _______________________________________________
>> mailing list
>> To unsubscribe, send any mail to ""
> --
> -----------------------------------------+-------------------------------
>  Prof. Luigi RIZZO,  . Dip. di Ing. dell'Informazione
>        . Universita` di Pisa
>  TEL      +39-050-2211611               . via Diotisalvi 2
>  Mobile   +39-338-6809875               . 56122 PISA (Italy)
> -----------------------------------------+-------------------------------
Received on Fri Dec 14 2012 - 12:57:44 UTC