incorrect usleep/select delays with HZ > 2500

Mon Sep 7 22:09:01 UTC 2009

Hi, all--

On Sep 7, 2009, at 12:21 AM, Luigi Rizzo wrote:
>>> 3. ?CAUSE an error in tvtohz(), reported long ago in
>>> ? ? ? ?http://www.dragonflybsd.org/presentations/nanosleep/
>>> ? ? ? ?which causes a systematic error of an extra tick in the  
>>> computation
>>> ? ? ? ?of the sleep times.
>>> ? ?FIX: the above link also contains a proposed fix (which in fact
>>> ? ? ? ?reverts a bug introduced in some old commit on FreeBSD)
>>> ? ?PROBLEM: none that i see.
>>
>> This change, as-is, is extremely dangerous.  tsleep/msleep() use a
>> value of 0 meaning 'forever'.  Changing tvtohz() so that it can now
>> return 0 for a non-zero tv is setting land-mines all over the place.
>> There's something like 27 callers of tvtohz() in sys/kern alone, some
>> of which are used to supply tsleep/msleep() timeouts.  Note that the
>> dragonflybsd patch above only adds the 'returns 0' check to one  
>> single
>> caller.  You either need to patch all callers of tvtohz() since  
>> you've
>> change the semantics, or add a 'if (ticks == 0) ticks = 1' check (or
>> checks) in the appropriate places inside tvtohz().
>>
>> If you don't do it, then you end up with callers of random functions
>> with very small timeouts instead finding themselves sleeping forever.
>
> You are right, a proper fix for this third issue should be different
> (if we want a fix at all -- I'd be almost satisfied by just
> removing the drift).

A while ago, I took a fairly close look at timing accuracy (ie, trying  
to sleep until exactly the next second or whatever, and then comparing  
the actual time you were woken up against the target time), did my own  
fancy graphs showing sawtooth waves indicating timer aliasing  
problems, etc.  There's no question that the system is allowed to  
sleep longer than requested, and, if the system is busy doing other  
tasks, some amount of extra delay is expected and OK, but missing the  
target time by more than several ticks, especially if the system is  
idle, is bogus.

The Dragonfly patch to tvtohz() definitely improves timer accuracy,  
but it ends up computing a tick value which is too small by one  
sometimes (ie, the target time lands *within* that scheduler quantum,  
rather than at or before the start of that tick), causing negative  
offsets from the desired time (ie, you woke up just a bit too soon).   
That's also bogus.  :-)

 From my own testing of various platforms, as well as the results  
reported elsewhere, MacOS X has the most consistent timer accuracy,  
which is probably not too surprising given that soft realtime  
capabilities for audio/visual processing tasks such as iTunes,  
Quicktime, LogicStudio, etc are pretty important to the Mac userbase.   
I just compared the tvtohz() implementations in kern/kern_clock.c  
between FreeBSD and Darwin/XNU (specifically xnu-1228.15.4, the kernel  
for 10.5.8), and they are effectively identical.  However, that's  
probably moot because Darwin uses tvtoabstime() rather than tvtohz()  
for the callouts in the setitimer()/getitimer() implementation from  
kern/kern_time.c:

/* FreeBSD */
int
kern_setitimer(struct thread *td, u_int which, struct itimerval *aitv,  
struct itimerval *oitv)
{
[ ... ]
         if (which == ITIMER_REAL) {
                 PROC_LOCK(p);
                 if (timevalisset(&p->p_realtimer.it_value))
                         callout_stop(&p->p_itcallout);
                 getmicrouptime(&ctv);
                 if (timevalisset(&aitv->it_value)) {
                         callout_reset(&p->p_itcallout, tvtohz(&aitv- 
 >it_value),
                             realitexpire, p);
                         timevaladd(&aitv->it_value, &ctv);
                 }
                 *oitv = p->p_realtimer;
                 p->p_realtimer = *aitv;
                 PROC_UNLOCK(p);
                 if (timevalisset(&oitv->it_value)) {
                         if (timevalcmp(&oitv->it_value, &ctv, <))
                                 timevalclear(&oitv->it_value);
                         else
                                 timevalsub(&oitv->it_value, &ctv);
                 }
[ ... ]

/* Darwin */
int
setitimer(struct proc *p, struct setitimer_args *uap, register_t  
*retval)
{
[ ... ]
         switch (uap->which) {

         case ITIMER_REAL:
                 proc_spinlock(p);
                 if (timerisset(&aitv.it_value)) {
                         microuptime(&p->p_rtime);
                         timevaladd(&p->p_rtime, &aitv.it_value);
                         p->p_realtimer = aitv;
                         if (!thread_call_enter_delayed(p->p_rcall,  
tvtoabstime(&p->p_rtime)))   // [1]
                                 p->p_ractive++;
                 } else  {
                         timerclear(&p->p_rtime);
                         p->p_realtimer = aitv;
                         if (thread_call_cancel(p->p_rcall))
                                 p->p_ractive--;
                 }
                 proc_spinunlock(p);

                 break;
[ ... ]

> The simplest option is perhaps to compute a custom value for
> nanosleep, select and poll. This would remove the risk of side
> effects to other parts of the system, and we could also use the
> chance to compensate for the errors that arise when hz*tick !=
> 1000000 or when we know that hardclock does not run exactly every
> 'tick' (an integer) microseconds.

I think it would be relatively simple to adapt the DragonFly change  
but ensure that tztohz() returns at least 1, and that would be helpful  
improvement.

However, in the long run, a callout mechanism based upon scheduler  
ticks rather than actual time values suffers an inherent loss of  
accuracy (you're performing an integer division of time intervals by  
HZ and rounding up to use the API, after all) and makes assumptions  
that the scheduler's clock is always running at a constant frequency  
such that tick * HZ is exactly a million microseconds.

Regards,
-- 
-Chuck

[1]: See http://www.opensource.apple.com/source/xnu/xnu-1228.15.4/osfmk/kern/thread_call.c