select/poll/usleep precision on FreeBSD vs Linux vs OSX

Thu Mar 1 03:14:18 UTC 2012

On Thu, 1 Mar 2012, Luigi Rizzo wrote:

> On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote:
>> On Wed, 29 Feb 2012, Luigi Rizzo wrote:
>>> 	        |    Actual timeout
>>>               |      select            | poll  | usleep|
>>> 	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
>>> 	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
>>> 	--------+-------+-------+--------+-------+-------+
>>> 	    1      2000      99       6     0      2000
>>> 	   10      2000     109      15     0      2000
>>> 	   50      2000     149      66     0      2000
>>> 	  100      2000     196     133     0      2000
>>> 	  500      2000     597     617     0      2000
>>> 	 1000      2000    1103    1136    2000    2000
>>> 	 1001      3000    1103    1136    2000    3000 <---
>>> 	 1500      3000    1608    1631    2000    3000 <---
>>>      2000	   3000    2096    2127    3000    3000
>>> 	 2001	   4000                    3000    4000 <---
>>> 	 3001	   5000                    4000    5000 <---
>>>
>>> Note how the rounding (poll has the timeout in milliseconds) affects
>>
>> You must have synced with timer interrupts to get the above.  Timeouts
>
> yes i have -- the test code does almost nothing after returning from
> a select, on a system that does some amount of work times could be
> up to 1000us shorter. Still a huge error on short timeouts.

I get the sync but not the rounded timeouts, on my ~5.2 kernel with
HZ = 100.  The times are typically 19900-19993 for rounding up 1 us
to 2 ticks.

> I should also comment that these are average values on an otherwise
> idle system -- i will try to post a histogram of the actual values,
> it might well be that osx and linux have quantized values very
> different from the average (though this would violate the specs,
> so i suspect instead that they have some cheap one-shot timers).
>
> For FreeBSD I have also rounded the bsd values (actual averages are -1/+3us
> over 1sec experiments).

Oh.  The jitter is of minor interest, and rounding to usec should show
an average of slightly less than the timeout rounded up to ticks (on
an unloaded system).

Bakul Shah confirmed that Linux now reprograms the timer.  It has to,
for a tickless kernel.  FreeBSD reprograms timers too.  I think you
can set HZ large and only get timeout interrupts at that frequency if
there are active timeouts that need them.  Timeout granularity is still
1/HZ.

Hmm, this may explain why you are getting exact n000's -- every time
you ask for a timeout, you get one n000 us later (on a near-idle machine
where nothing else is asking for many timeouts), while old kernels
give timeouts on perfectly periodic n000(+error) boundaries; now when
the syscall is made just after a boundary, the boundary for the timeout
is never a full n000 away.  There may be a lot of jitter for both, but
if the reprogramming of the timer when you ask for a new timeout is
too smart, then the jitter will average out to 0, giving perfect n000's.

Try running multiple sources of new timeouts.  I think a periodic
itimer should produce perfectly periodic ones with little overhead.
Then other timeouts should not change the periodicity or even
reprogram the timer.

Reprogramming on demand seems to give unwanted aperiodicity: you ask for
a delay of 1 and get 2000.  Suppose you actually want 2000, and actually
get it relative to the request time.  Then the timer must be interrupting
aperiodically, with an average period of 2000+(overhead time of say 2) 
possibly with large jitter.  So 500 of these take 1 second plus 1000 us,
plus any jitter (the jitter may be negative, but is most likely positive,
since when the process setting up the timeouts is preempted and nothing
else is setting them up, there may be a large additional delay).

I try to avoid this problem in my version of ping.  I try to send a packet
on every 1 second boundary.  Normal ping tries to send one 1 second after
the previous one, but it can't do this since it has overheads and gets
preempted.  With HZ=100 and rounding up and adding 1, the drift is likely
to be 20 msec every second or 2%.  This is quite a lot.  My version tries
to schedule a timeout that expires exactly 1 second after the previous
packet was sent, not 1 second after the current time.  It takes a simple
subtraction to determine the timeout to reach the next seconds boundary,
but determining the times to subtract seems to require an extra
gettimeofday() call.  I should use a periodic itimer and depend on it
actually being periodic.  The kernel must do similar things to keep
periodic itimers actually periodic after it reprograms timers.  There
may be a lot of jitter on each reprogramming, but this can be compensated
for on average.  OTOH, as for skewing clocks, the compensation shouldn't
go too fast in either direction.  This could get complicated.  I don't
know what -current actually does.

Bruce