select/poll/usleep precision on FreeBSD vs Linux vs OSX

Wed Feb 29 20:55:20 UTC 2012

On 29.02.2012 21:40, Luigi Rizzo wrote:
> I have always been annoyed by the fact that FreeBSD rounds timeouts
> in select/usleep/poll in very conservative ways, so i decided to
> try how other systems behave in this respect. Attached is a simple
> program that you should be able to compile and run on various OS
> and see what happens.
>
> Here are the results (HZ=1000 on the system under test, and FreeBSD
> has the same behaviour since at least 4.11):
>
> 	        |    Actual timeout
>                  |      select            | poll  | usleep|
> 	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
> 	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
> 	--------+-------+-------+--------+-------+-------+
> 	    1      2000      99       6     0      2000
> 	   10      2000     109      15     0      2000
> 	   50      2000     149      66     0      2000
> 	  100      2000     196     133     0      2000
> 	  500      2000     597     617     0      2000
> 	 1000      2000    1103    1136    2000    2000
> 	 1001      3000    1103    1136    2000    3000<---
> 	 1500      3000    1608    1631    2000    3000<---
>           2000	   3000    2096    2127    3000    3000
> 	 2001	   4000                    3000    4000<---
> 	 3001	   5000                    4000    5000<---
>
>
> Note how the rounding (poll has the timeout in milliseconds) affects
> the actual timeouts when you are past multiples of 1/HZ.
>
> I know that until we have some hi-res interrupt source there is no
> hope to have better than 1/HZ granularity. However we are doing
> much worse by adding up to 2 extra ticks. This makes apps less
> responsive than they could be, and gives us no way to
> "yield until the next tick".
>
> So what I would like to do is add a sysctl (disabled by
> default) that enables a better approximation of the desired delay.
>
> I see in the kernel that all three syscalls loop around a blocking
> function (tsleep or seltdwait), and do check the "actual" elapsed
> time by calling getmicrouptime() or getnanouptime() around the
> sleeping function .  So the actual timeout passed to tsleep does
> not really matter (as long as it is greater than 0 ).
>
> The only concern is that getmicrouptime()/getnanouptime() are documented
> as "less precise, but faster to obtain". The question is how precise is
> "less precise": do we have some way to get an upper bound for the
> precision of the timers used in get*time(), so we can use that value
> in the equation instead of the extra 1/HZ that tvtohz() puts in
> after computing floor(timeout*HZ) ?

"less precise" there means they are updated on hardclock() invocation 
every 1/HZ.

> For reference, below is the core of usleep and select/poll
> (from kern_time.c and sys_generic.c)
>
>      usleep:
> 	getnanouptime(now)
> 	end = now + timeout;
> 	for (;;) {
> 		getnanouptime(now);
> 		delta = end - now;
> 		if (delta<= 0)
> 			break;
> 		tsleep(..., tvtohz(delta) )
> 	}
>
>      select/poll:
> 	itimerfix(timeout) // force at least 1/HZ
> 	getmicrouptime(now)
> 	end = now + timeout;
> 	for (;;) {
> 		delta = end - now;
> 		seltdwait(..., tvtohz(delta) )
> 		getmicrouptime(now);
> 		if (some_fd_is_ready() || now>= end)
> 			break;
> 	}
>

-- 
Alexander Motin