incorrect usleep/select delays with HZ > 2500

Mon Sep 7 19:52:11 UTC 2009

On Mon, Sep 07, 2009 at 09:21:59AM +0200, Luigi Rizzo wrote:
> On Sun, Sep 06, 2009 at 05:36:29PM -0700, Peter Wemm wrote:
> > On Sun, Sep 6, 2009 at 8:51 AM, Luigi Rizzo<rizzo at iet.unipi.it> wrote:
> > > (this problem seems to affect both current and -stable,
> > > so let's see if here i have better luck)
> > >
> > > I just noticed [Note 1,2] that when setting HZ > 2500 (even if it is
> > > an exact divisor of the APIC/CPU clock) there is a significant
> > > drift between the delays generated by usleep()/select() and those
> > > computed by gettimeofday(). ?In other words, the error grows with
> > > the amount of delay requested.
> > >
> > > To show the problem, try this function
> > >
> > > ? ? ? ?int f(int wait_time) { ?// wait_time in usec
> > > ? ? ? ? ? ? ? ?struct timeval start, end;
> > > ? ? ? ? ? ? ? ?gettimeofday(&start);
> > > ? ? ? ? ? ? ? ?usleep(w); ? ? ?// or try select
> > > ? ? ? ? ? ? ? ?gettimeofday(&end)
> > > ? ? ? ? ? ? ? ?timersub(&end, &start, &x);
> > > ? ? ? ? ? ? ? ?return = x.tv_usec + 1000000*x.tv_sec - wait_time;
> > > ? ? ? ?}
> > >
> > > for various HZ (kern.hz=NNNN in /boot/loader.conf) and wait times.
> > > Ideally, we would expect the timings to be in error by something
> > > between 0 and 1 (or 2) ticks, irrespective of the value of wait_time.
> > > In fact, this is what you see with HZ=1000, 2000 and 2500.
> > > But larger values of HZ (e.g. 4000, 5000, 10k, 40k) create
> > > a drift of 0.5% and above (i.e. with HZ=5000, a 1-second delay
> > > lasts 1.0064s and a 10s delay lasts 10.062s; with HZ=10k the
> > > error becomes 1% and at HZ=40k the error becomes even bigger.
> > 
> > Technically, it isn't even an error because the sleeps are defined as
> > 'at least' the value specified.  If you're looking for real-time-OS
> > level performance, you probably need to look at one.
> 
> I know a non RTOS has limitations on the guarantees it can give.
> But with this in mind (e.g.  be happy with something that works as
> expected just 'most of the times'), on this specific issue FreeBSD
> could behave much better while still remaining technically correct.

  I agree.  The best is the enemy of the good; while it will never be
an RTOS, there is no reason not to make the behavior as good as
possible, especially where a small investment of effort is required.

> > > 3. ?CAUSE an error in tvtohz(), reported long ago in
> > > ? ? ? ?http://www.dragonflybsd.org/presentations/nanosleep/
> > > ? ? ? ?which causes a systematic error of an extra tick in the computation
> > > ? ? ? ?of the sleep times.
> > > ? ?FIX: the above link also contains a proposed fix (which in fact
> > > ? ? ? ?reverts a bug introduced in some old commit on FreeBSD)
> > > ? ?PROBLEM: none that i see.
> > 
> > This change, as-is, is extremely dangerous.  tsleep/msleep() use a
> > value of 0 meaning 'forever'.  Changing tvtohz() so that it can now
> > return 0 for a non-zero tv is setting land-mines all over the place.
> > There's something like 27 callers of tvtohz() in sys/kern alone, some
> > of which are used to supply tsleep/msleep() timeouts.  Note that the
> > dragonflybsd patch above only adds the 'returns 0' check to one single
> > caller.  You either need to patch all callers of tvtohz() since you've
> > change the semantics, or add a 'if (ticks == 0) ticks = 1' check (or
> > checks) in the appropriate places inside tvtohz().
> > 
> > If you don't do it, then you end up with callers of random functions
> > with very small timeouts instead finding themselves sleeping forever.
> 
> You are right, a proper fix for this third issue should be different
> (if we want a fix at all -- I'd be almost satisfied by just
> removing the drift).

  Simply returning max(1, ticks) from tvtohz, as suggested above, would
eliminate this concern and still improve the performance for a great
number of cases.

> The simplest option is perhaps to compute a custom value for
> nanosleep, select and poll. This would remove the risk of side
> effects to other parts of the system, and we could also use the
> chance to compensate for the errors that arise when hz*tick !=
> 1000000 or when we know that hardclock does not run exactly every
> 'tick' (an integer) microseconds.

  This would also be a good solution.  It would be perfectly possible
to implement it in two stages - first by adapting tvothz() as above, as
an immediate and hopefully non-controversial improvement, and second by
refactoring the latter proposed change so that the non-sleep users
(nanosleep etc.) call a separate routine which is allowed to return 0;
tvtohz() could then be replaced with the degenerate function { return
(ticks = trueticks(tv) ? ticks : 1; }

  All IMHO, of course.
  -- Clifton

-- 
    Clifton Royston  --  cliftonr at iandicomputing.com / cliftonr at lava.net
       President  - I and I Computing * http://www.iandicomputing.com/
 Custom programming, network design, systems and network consulting services