Re: [RFC/RFT] calloutng

From: Davide Italiano <davide_at_freebsd.org>
Date: Fri, 14 Dec 2012 13:57:36 +0100
On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo <rizzo_at_iet.unipi.it> wrote:
>
> On Fri, Dec 14, 2012 at 12:12 AM, Davide Italiano <davide_at_freebsd.org>
> wrote:
>>
>> Hi.
>> This patch takes callout(9) and redesign the KPI and the
>> implementation. The main objective of this work is making the
>> subsystem tickless.  In the last several years, this possibility has
>> been discussed widely (http://markmail.org/message/q3xmr2ttlzpqkmae),
>> but until now noone really implemented that.
>> If you want a complete history of what has been done in the last
>> months you can check the calloutng project repository
>> http://svnweb.freebsd.org/base/projects/calloutng/
>> For lazy people, here's a summary:
>
>
> thanks for the work and the detailed summary.
> Perhaps it would be useful if you could provide a few high level
> details on the use and performance of the new scheme, such as:
>
> - is the old callout KPI still available ? (i am asking because it would
>   help maintaining third party kernel modules that are expected to
>   work on different FreeBSD releases)
>

Obviously the old KPI is still available. callout(9) is a very popular
interface and I don't think removing the old interface is a good idea,
because could make unhappy some vendor when its code doesn't build
anymore on FreeBSD.

> - do you have numbers on what is the fastest rate at which callouts
>   can be fired (e.g. say you have a callout which increments a
>   counter and schedules the next callout in (struct bintime){0,1} ) ?
>
>
> - is there a possibility that if callout requests are too close to each
>   other  (e.g. the above test) the thread dispatching callouts will
>   run forever ? if so, is there a way to make such thread yield
>   after a while ?
>
> - since you mentioned nanosleep() poll() and select() have been
>   ported to the new callout, is there a way to guarantee that user
>   using these functions with a very short timeout are actually
>   descheduled as opposed to "interval too short, don't bother" ?
>
> - do you have numbers on how many calls per second we can
>   have for a process that does
>       for (;;) {  nanosleep(min_value_that_causes_descheduling);
>
> I also have some comments on the diff:
> - can you provide a diff -p ?
>
> - for several functions the only change is the name of an argument
>   from "busy" to "us". Can you elaborate the reason for the change,
>   and whether "us" means microseconds or the pronoun ?)
>

Please see r242905 by mav_at_.

> Finally, a more substantial comment:
> - a lot of functions which formerly had only a "timo" argument
>   now have "timo, bt, precision, flags". Take seltdwait() as an example.
>

seltdwait() is not part of the public KPI. It has been modified to
avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e.
two separate functions, even though we could share most of the code is
not a clever approach, IMHO.
As I told before, seltdwait() is not exposed so we can modify its
argument without breaking anything.

>   It seems that you have been undecided between two approaches:
>   for some of these functions you have preserved the original function
>   that deals with ticks and introduced a new one that deals with the
> bintime,
>   whereas in other cases you have modified the original function to add
>   "bt, precision, flags".
>

I'm not. All the functions which are part of the public KPI (e.g.
condvar(9), sleepq(9), *) are still available.  *_flags variants have
been introduced so that consumers can take advantage of the new
'precision tolerance mechanism' implemented. Also, *_bt variants have
been introduced. I don't see any "undecision" between the two
approaches.
Please note that now the callout backend deals with bintime, so every
time callout_reset_on() is called, the 'tick' argument passed is
silently converted to bintime.

>   I would suggest a more uniform approach, namely:
>   - preserve all the existing functions (T) that take a timeout in ticks;
>   - add a new set of corresponding functions (BT) that take
>     bt, precision, flags _instead_ of the ticks
>   - the functions (T) make immediately the conversion from ticks to
>     bintime(s), using macros or inline
>   - optionally, convert kernel components to the new (BT) functions
>     where this makes sense (e.g. we can exploit the finer-granularity
>     of the new calls, etc.)
>



> cheers
> luigi
>
>  1) callout(9) is not anymore constrained to the resolution a periodic
>>
>> "hz" clock can give. In order to do that, the eventtimers(4) subsystem
>> is used as backend.
>> 2) Conversely from what discussed in past, we maintained the callwheel
>> as underlying data structure for keeping track of the outstading
>> timeouts. This choice has a couple of advantages, in particular we can
>> still take benefits from the O(1) average complexity of the wheel for
>> all the operations. Also, we thought the code duplication that would
>> arise from the use of a two-staged backend for callout (e.g. use wheel
>> for coarse resolution event and another data structure, such as an
>> heap for high resolution events), is unacceptable. In fact, as long as
>> callout gained the ability to migrate from a cpu to another having a
>> double backend would mean doubling the code for the migration path.
>> 3) A way to dispatch interrupts from hardware interrupt context has
>> been implemented, using special callout flag. This has limited
>> applicability, but avoid the dispatching of a SWI thread for handling
>> specific callouts, avoiding the wake up of another CPU for processing
>> and a (relatively useless) context switch
>> 4) As long as new callout mechanism deals with bintime and not anymore
>> with ticks, time is specified as absolute and not relative anymore. In
>> order to get current time binuptime() or getbinuptime() is used, and a
>> sysctl is introduced to selectively choose the function to use, based
>> on a precision threshold.
>> 5) A mechanism for specifying precision tolerance has been
>> implemented. The callout processing mechanism has been adapted and the
>> callout data structure augmented so that the codepath can take
>> advantage and aggregate events which overlap in time.
>>
>>
>> The new proposed KPI for callout is the following:
>> callout_reset_bt_on(..., struct bintime time, struct bintime pr, ..., int
>> flags)
>> where ‘time’ argument represets the time at which the callout should
>> fire, ‘pr’ represents the precision tolerance expressed as an absolute
>> value, and ‘flags’, which could be used to specify new features, i.e.
>> for now, the possibility to run the callout from fast interrupt
>> context.
>> The old KPI has been extended introducing the callout_reset_flags()
>> function, which is the same of callout_reset*(), but takes an
>> additional argument ‘int flags’ that can be used in the same fashion
>> of the ‘flags’ argument for the new KPI. Using the ‘flags’ consumers
>> can also specify relative precision tolerance in terms of power-of-two
>> portion of the timeout passed as ticks.
>> Using this strategy, the new precision mechanism can be used for the
>> existing services without major modifications.
>>
>> Some consumers have been ported to the new KPI, in particular
>> nanosleep(), poll(), select(), because they take immediate advantage
>> from the arbitrary precision offered by the new infrastructure.
>> For some statistics about the outcome of the conversion to the new
>> service, please refer to the end of this e-mail:
>> http://lists.freebsd.org/pipermail/freebsd-arch/2012-July/012756.html
>> We didn't measure any significant performance regressions with
>> hwmpc(4), using some benckmarks programs:
>> http://people.freebsd.org/~davide/poll_test/poll_test.c
>> http://people.freebsd.org/~mav/testsleep.c
>> http://people.freebsd.org/~mav/testidle.c
>>
>> We tested the code on amd64, MIPS and arm. Any kind of testing or
>> comment would be really appreciated. The full diff of the work against
>> HEAD can be found at: http://people.freebsd.org/~davide/calloutng.diff
>> If noone have objections, we plan to merge the repository to HEAD in a
>> week or so.
>>
>> Thanks,
>>
>> Davide
>> _______________________________________________
>> freebsd-current_at_freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>
>
>
>
> --
> -----------------------------------------+-------------------------------
>  Prof. Luigi RIZZO, rizzo_at_iet.unipi.it  . Dip. di Ing. dell'Informazione
>  http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
>  TEL      +39-050-2211611               . via Diotisalvi 2
>  Mobile   +39-338-6809875               . 56122 PISA (Italy)
> -----------------------------------------+-------------------------------
>
Received on Fri Dec 14 2012 - 12:57:44 UTC