a proposed callout API

Wed Nov 29 11:24:30 PST 2006

    Since nearly all callout_reset() calls use the same relative timeout as
    previously, it seems rather polluting to expose the low level tick
    calculations in the API.  I'll bet if you just *CACHE* the last 
    translation it would be sufficient to optimize your callout paths:

    callout_reset(...)
    {
	if (to_ticks == c->last_to_ticks)
		... use c->last_to_translated_ticks;
	else
		... recalculate
	...
    }

    Insofar as math overhead goes... well, if you REALLY want to make things
    optimal you need to get rid of all those mutex operations you are doing
    in the low level callwheel code.

    I would recommend doing what we did, which is to make the call wheels
    per-cpu and to issue the callout on the same cpu it was registered on.
    Now, granted, DragonFly uses a more cpu-localized design, particularly
    for network operations (which are the vast majority of callout operations
    in the system).  But you should really consider it.  A cpu-localized 
    design replaces all mutexes and spinlocks in the implementation with
    a simple critical section.  Cross-cpu operations use IPI messages (which,
    in DragonFly, very rarely occur since all the callout users are
    cpu-localized).  But assuming you deal with that issue in your network
    stacks, OTHER uses of the callout API are well served by a cpu-localized
    model.  Because re-arming usually occurs FROM the callout callback 
    procedure, which itself is cpu-localized by the callout implementation,
    you again wind up being able to use just a critical section and no
    mutexes or spin locks.

    One mutex or spinlock is worth half a dozen math operations.  Even if
    the locked bus cycle memory location is already owned by the calling
    cpu you still wind up flushing the cpu's read and write pipeline, and
    that is really nasty at the beginning of a procedure when the caller
    of the procedure has just pushed a bunch of arguments onto the stack.

    There is virtually no cache overhead in handling the callwheel due to
    the burstiness effect of the slots, in particular when handling TCP
    connections in bulk.  There is so much locality of reference there
    that for all intents and purposes callout_reset() becomes FREE if you
    can just get rid of the mutexes.

    In anycase, network operations are a bad place to use fine-grained
    timeouts.  It just doesn't work well... for example, using a TCP retry
    timeout in the microsecond range almost guarentees a ton of false hits
    due to cpu latency in handling the timeout on a heavily loaded system.
    You need wiggle room and lost packets just aren't an issue on LANs.

    Similarly if you want to change tsleep to use a fine-grained value,
    the same rule applies... when tsleep is called with a timeout it is 
    almost always called with the same timeout.  But nearly all uses of
    tsleep are insensitive to the granularity of the timeout, and most
    remaining uses are not in critical code paths (e.g. a device driver 
    that is resetting some low level hardware interface or something), so
    it is questionable whether changing the API would reap any visible
    reward.

    There are a few places where a fine-grained timer is really useful, in
    particular a periodic fine-grained timer.   But don't try to do it 
    with the callout API.  I recommend taking a look at our SYSTIMER API.
    We use it to drive interface polling, the scheduler, the stat clock,
    the hardclock, and to rate-limit interrupts.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>