[RFC/RFT] calloutng
Luigi Rizzo
rizzo at iet.unipi.it
Wed Jan 2 10:58:35 UTC 2013
On Mon, Dec 31, 2012 at 12:17:27PM +0200, Alexander Motin wrote:
> On 31.12.2012 08:17, Luigi Rizzo wrote:
> >On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...
> >>Then I noticed you had a 12_26 patchset so I tested
> >>that (after crudely fixing a couple uninitialized var warnings), and it
> >>all looks good on this arm (Raspberry Pi). I'll attach the results.
> >>
> >>It's so sweet to be able to do precision sleeps.
>
> Thank you for testing, Ian.
>
> >interesting numbers, but there seems to be some problem in computing
> >the exact interval; delays are much larger than expected.
> >
> >In this test, the original timer code used to round to the next multiple
> >of 1 tick and then add another tick (except for the kqueue case),
> >which is exactly what you see in the second set of measurements.
> >
> >The calloutng code however seems to do something odd:
> >in addition to fixed overhead (some 50us, which you can see in
> >the tests for 1us and 300us), all delay seem to be ~10% larger
> >than what is requested, upper bounded to 10ms (note, the
> >numbers are averages so i cannot tell whether all samples are
> >the same or there is some distribution of values).
> >
> >I am not sure if this error is peculiar of the ARM version or also
> >appears on x86/amd64 but I believe it should be fixed.
> >
> >If you look at the results below:
> >
> >1us possily ok:
> > for very short intervals i would expect some kind
> > of 'reschedule' without actually firing a timer; maybe
> > 50us are what it takes to do a round through the scheduler ?
> >
> >300us probably ok
> > i guess the extra 50-90us are what it takes to do a round
> > through the scheduler
> >
> >1000us borderline (this is the case for poll and kqueue, which are
> > rounded to 1ms)
> > here intervals seem to be increased by 10%, and i cannot see
> > a good reason for this (more below).
> >
> >3000us and above: wrong
> > here again, the intervals seem to be 10% larger than what is
> > requested, perhaps limiting the error to 10-20ms.
> >
> >
> >Maybe the 10% extension results from creating a default 'precision'
> >for legacy calls, but i do not think this is done correctly.
> >
> >First of all, if users do not specify a precision themselves, the
> >automatically generated value should never exceed one tick.
> >
> >Second, the only point of a 'precision' parameter is to merge
> >requests that may be close in time, so if there is already a
> >timer scheduled within [Treq, Treq+precision] i will get it;
> >but if there no pending timer, then one should schedule it
> >for the requested interval.
> >
> >Davide/Alexander, any ideas ?
>
> All mentioned effects could be explained with implemented logic. 50us at
> 1us is probably sum of minimal latency of the hardware eventtimer on the
> specific platform and some software processing overhead (syscall,
> callout, timecouters, scheduler, etc). At later points system starts to
> noticeably use precision specified by kern.timecounter.alloweddeviation
> sysctl. It affects results from two sides: 1) extending intervals for
> specified percent of time to allow event aggregation, and 2) choosing
> time base between fast getbinuptime() and precise binuptime(). Extending
> interval is needed to aggregate not only callouts with each other, but
> also callouts with other system events, which are impossible to schedule
> in advance. It gives specified relative error, but no more then one CPU
> wakeup period in absolute: for busy CPU (not skipping hardclock() ticks)
> it is 1/hz, for completely idle one it can be up to 0.5s. Second point
> allows to reduce processing overhead by the cost of error up to 1/hz for
> long periods (>(100/allowed)*(1/hz)), when it is used.
i am not sure what you mean by "extending interval", but i believe the
logic should be the following:
- say user requests a timeout after X seconds and with a tolerance of D second
(both X and D are fractional, so they can be short). Interpret this as
"the system should do its best to generate an event between X and X+D seconds"
- convert X to an absolute time, T_X
- if there are any pending events already scheduled between T_X and T_X+D,
then by definition they are acceptable. Attach the requested timeout
to the earliest of these events.
- otherwise, schedule an event at time T_X (because there is no valid
reason to generate a late event, and it makes no sense from an
energy saving standpoint, either -- see below).
It seems to me that you are instead extending the requested interval
upfront, which causes some gratuitous pessimizations in scheduling
the callout.
Re. energy savings: the gain in extending the timeout cannot exceed
the value D/X. So while it may make sense to extend a 1us request
to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s,
it is completely pointless from an energy saving standpoint to
introduce a 10ms error on a 300ms request.
(even though i hate the idea that a 1us request defaults to
a 50us delay; but that is hopefully something that can be tuned
in a platform-specific way and perhaps at runtime).
cheers
luigi
> To get best possible precision kern.timecounter.alloweddeviation sysctl
> can be set to smaller value. Setting it to 0 will effectively disable
> all optimizations, but should give 50us precision in all cases.
>
> >>for t in 1 300 3000 30000 300000 ; do
> >> for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
> >> ./testsleep $t $m
> >> done
> >>done
> >>
> >>
> >>With calloutng_12_26.patch...
> >>
> >> HZ=100 HZ=250 HZ=1000
> >>---------- ---------------- ---------------- ----------------
> >>select 1 55.79 1 50.96 1 61.32
> >>poll 1 1109.46 1 1107.86 1 1114.38
> >>usleep 1 56.33 1 72.90 1 62.78
> >>nanosleep 1 52.66 1 55.23 1 64.23
> >>kqueue 1 1114.23 1 1113.81 1 1121.21
> >>kqueueto 1 65.44 1 71.00 1 75.01
> >>syscall 1 4.70 1 4.45 1 4.55
> >>select 300 355.79 300 357.76 300 362.35
> >>poll 300 1107.85 300 1122.55 300 1115.62
> >>usleep 300 355.28 300 357.28 300 360.79
> >>nanosleep 300 354.49 300 355.82 300 360.62
> >>kqueue 300 1112.57 300 1118.13 300 1117.16
> >>kqueueto 300 375.98 300 378.62 300 395.61
> >>syscall 300 4.41 300 4.45 300 4.54
> >>select 3000 3246.75 3000 3246.74 3000 3252.72
> >>poll 3000 3238.10 3000 3229.12 3000 3250.10
> >>usleep 3000 3242.47 3000 3237.06 3000 3249.61
> >>nanosleep 3000 3238.79 3000 3231.55 3000 3248.11
> >>kqueue 3000 3240.01 3000 3236.07 3000 3247.60
> >>kqueueto 3000 3265.36 3000 3267.22 3000 3274.96
> >>syscall 3000 4.69 3000 4.44 3000 4.50
> >>select 30000 31714.60 30000 31941.17 30000 32467.69
> >>poll 30000 31522.76 30000 31983.00 30000 32497.81
> >>usleep 30000 31459.67 30000 31980.76 30000 32458.71
> >>nanosleep 30000 31431.02 30000 31982.22 30000 32525.20
> >>kqueue 30000 31466.75 30000 31873.90 30000 31973.54
> >>kqueueto 30000 31564.67 30000 32522.35 30000 32475.59
> >>syscall 30000 4.70 30000 4.73 30000 4.89
> >>select 300000 319133.02 300000 311562.33 300000 309918.62
> >>poll 300000 319604.27 300000 311422.94 300000 310000.76
> >>usleep 300000 319314.60 300000 311269.69 300000 309996.34
> >>nanosleep 300000 319497.58 300000 311425.40 300000 309997.13
> >>kqueue 300000 309995.55 300000 303980.27 300000 309908.82
> >>kqueueto 300000 319505.88 300000 311424.97 300000 309996.16
> >>syscall 300000 4.41 300000 4.45 300000 4.89
> >>
> >>
> >>With no patches...
> >>
> >> HZ=100 HZ=250 HZ=1000
> >>---------- ---------------- ---------------- ----------------
> >>select 1 19941.70 1 7989.10 1 1999.16
> >>poll 1 19904.61 1 7987.32 1 1999.78
> >>usleep 1 19904.95 1 7993.30 1 1999.96
> >>nanosleep 1 19905.64 1 7993.71 1 1999.72
> >>kqueue 1 10001.61 1 4004.00 1 1000.27
> >>kqueueto 1 19904.00 1 7993.03 1 1999.54
> >>syscall 1 4.04 1 4.05 1 4.75
> >>select 300 19904.66 300 7998.39 300 2000.27
> >>poll 300 19904.35 300 7993.47 300 1999.86
> >>usleep 300 19903.96 300 7994.11 300 1999.81
> >>nanosleep 300 19904.48 300 7993.77 300 1999.80
> >>kqueue 300 10001.68 300 4004.18 300 1000.31
> >>kqueueto 300 19997.86 300 7993.37 300 1999.59
> >>syscall 300 4.01 300 4.00 300 4.32
> >>select 3000 19904.80 3000 7998.85 3000 3998.43
> >>poll 3000 19904.92 3000 8005.93 3000 3999.39
> >>usleep 3000 19904.50 3000 7992.88 3000 3999.44
> >>nanosleep 3000 19904.84 3000 7993.34 3000 3999.36
> >>kqueue 3000 10001.58 3000 4003.97 3000 3000.72
> >>kqueueto 3000 19903.56 3000 7993.24 3000 3999.34
> >>syscall 3000 4.02 3000 4.37 3000 4.29
> >>select 30000 39905.02 30000 35991.79 30000 31051.77
> >>poll 30000 39905.49 30000 35980.35 30000 30995.64
> >>usleep 30000 39903.78 30000 35979.48 30000 30995.23
> >>nanosleep 30000 39904.55 30000 35981.61 30000 30995.87
> >>kqueue 30000 30002.73 30000 32019.54 30000 30004.83
> >>kqueueto 30000 39903.59 30000 35979.64 30000 30996.05
> >>syscall 30000 4.44 30000 4.04 30000 4.31
> >>select 300000 310001.23 300000 303995.86 300000 300994.30
> >>poll 300000 309902.73 300000 303981.58 300000 300996.17
> >>usleep 300000 309903.64 300000 303980.17 300000 300997.42
> >>nanosleep 300000 309903.32 300000 303980.36 300000 300993.64
> >>kqueue 300000 300002.77 300000 300019.46 300000 300006.90
> >>kqueueto 300000 309903.31 300000 303978.10 300000 300996.84
> >>syscall 300000 4.01 300000 4.04 300000 4.29
>
>
> --
> Alexander Motin
More information about the freebsd-arch
mailing list