Timekeeping [Was: Re: cvs commit: src/usr.bin/vmstat vmstat.c src/usr.bin/w w.c]

Sat Oct 22 03:17:27 PDT 2005

On Fri, 21 Oct 2005, Poul-Henning Kamp wrote:

> In message <01DFB595-5279-4D3A-BEDA-5F0285E9519B at xcllnt.net>, Marcel Moolenaar
> writes:
>
>>> I think we need the definition to consider if (process- ?)state is
>>> retained while the system is unconcious or not.
>>
>> I'm not sure. I think that might be what makes the definition
>> complex.
>
> Actually I don't think it does, it simplifies it.

I agree.  Except for statistics progams, it is necessary to keep as much
history as practical; in particular, don't forgot the original boot time,
and keep supporting averages since boot in vmstat and systat.

> If a process survives across the "unconcious" period, then it follows
> that CLOCK_MONOTONIC cannot be reset to zero in relation to the
> unconcious period.

What is survival?  Everything might be restarted virtually.

> But we are only just scratching the surface here, there are tons of
> ambiguities we need to resolve, for instance:
>
> 	select(...., {3m0s})
> 	suspend
> 	[ 2 minutes pass ]
> 	resume
>
> When does select time out ?
>
>    One minute after the resume ?
>
>    Three minutes after the resume ?
>
>    Right after the resume with a special errno ?

As close as possible to 3m0s after select() was called.

There are many longstanding bugs in this area.  I remember the following:
- the stillborn non-option APM_FIXUP_CALLTODO attempts to fix some of
   them, by reducing all timeouts by the suspend time.  (It was stillborn
   because it is for the pre-callwheel implementation of timeouts but was
   committed after callwheel timeouts, so it never compiled in any committed
   version.  The uselessness of APM_FIXUP_CALLTODO was hidden by not making
   it a normal option.)

   The problem of wrong timeouts after suspend is very old.  Not fixing it
   avoids thundering herds of timeout expiries after suspend.

- nanosleep(), select() and poll() use getnanouptime(), getmicrouptime() and
   getmicrouptime() to not-so-carefully check that the timeout has expired
   after they wake up (the wakeup is sometimes early or late due to minor
   inaccuracies; when it is early, we detect that not-so-carefully and go
   back to sleep; when it is late, we can't recover so we should request
   the timeout to always be a little early so that we can be as close to
   on time as possible).  These syscalls should use non-get*() versions
   and non-*uptime() versions so that they actually know if the timeout
   expired.  Using *uptime() doesn't work because it doesn't count suspend
   time.  Using non-*uptime() doesn't quite work either, since the system's
   best idea of the real time may jump backwards.  A monotonic clock that
   jumps forwards by the suspend time is needed.

- realitimexpire() has the same bug as nanosleep() and friends.  The very
   name of this function shows that it should not be using *uptime().
   According to setitimer(2), "ITIMER_REAL decrements in real time".
   Using get*() in it is more justified than in nanosleep() since it is
   lower level so its efficiency may be important.

> Some code should obviously know about the suspend/resume event,
> dhclient, wep, wpa, bgpd, sshd, just to mention a few

Code like cron should get enough notification be having timeouts expires
as soon as possible after resume (if they would have expired during the
suspend interval if there was no suspend).  Such code can then check the
actual time on the correct clock like nanosleep() and friends to see if
a critical time has been reached.

Bruce