cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c

Thu Oct 20 01:02:10 PDT 2005

On Tue, 18 Oct 2005, Poul-Henning Kamp wrote:

> [At the risk of repeating myself once more...]

> ...

> One of the things you have to realize is that once you go down this
> road you need a lot of code for all the conditionals.
>
> For instance you need to make sure that every new timestamp you
> hand out not prior to another one, no matter what is happening to
> the clocks.

Clocks are already incoherent in many ways:
- the times returned by the get*() functions incoherent with the ones
   returned by the functions that read the hardware, because the latter
   are always in advance of the former and the difference is sometimes
   visible at the active resolution.  POSIX tests of file times have
   been reporting this incoherency since timecounters were implemented.
   The tests use time() to determine the current time and stat() to
   determine file times.  In the sequence:

         t1 = time(...):
         sleep(1)
         touch(file);
         stat(file);
         t2 = mtime(file);

   t2 should be < t1, but the bug lets t2 == t1 happen.

- times are incoherent between threads unless the threads use their
   own expensive locking to prevent this.  This is not very different
   from timestamps being incoherent between CPUs unless the system uses
   expensive locking to prevent it.

> ...

>>> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
>>> switch is otherwise 4us, it adds up. i8254 is much worse on this
>>> system (6.5us).
>
> i8254 is always bad, and about as bad as it can.

The i8254 is not that bad, and far from as bad as can be.

> Mostly because
> of the need to disable interrupts (Actually, that's a critical
> section today, isn't it ?) and also hobbled by the three 8 bit
> ISA-bus(-like) accesses needed.

Mostly not:
- disabling interrupts is not necessary is was done mainly because it
   is most efficient except (apparently) on P4's.  It is only necessary
   to repeat the read if the conditions were changed underneath us by an
   interrupt.  Whether there was an interrupt can easily be determined
   by looking at the interrupt count.

   Disabling of interrupts is still always used, at least on i386's.  This
   is essential in the non-lapic case and good in the lapic case:
   - In the non-lapic case, the code hasn't changed significantly lately
     and still has an explicit hard-disablement.  There is a magic number
     of 20 i8254 cycles (spelled TIMER0_LATCH_COUNT in axed code) that
     gives a real-time requirement on the maximum time between the i8254
     timer read and the check for rollover.  Disabling interrupts is not
     sufficient to meet this requirement since bus activity may lengthen
     the time for the combined i/o to many more than 20 cycles (I've
     measured about 200 for similar code in getit()), but it mostly works.
     If interrupts were not hard-disabled, then almost any interrupt would
     break this requirement.
   - In the lapic case, there is now only a spin mutex on the clock lock.
     The lock is essential, and it gives a critical section which is almost
     as essential (since without the critical section a low priority
     thread reading the i8254 might be preempted while holding the
     lock).  Spin mutexes still hard-disable interrupts, so interrupts
     are still hard-disabled as a side effect.  Hard-disabling interrupts
     for spinlocks is a bug, but here it is good though not essential.
     It prevents fast interrupt handlers and low-level non-context-switching
     interrupt code from running.  There is no longer a requirement for
     completing the function in 20 i8254 cycles, but doing so is safest.

     The simplification in the lapic case has very little to do with
     interrupts, clock or otherwise.  The real-time requirement is now that
     i8254_get_timecount() be called significantly more often than the
     i8254 rolls over.  This is now easily satisfied by increasing the
     rollover period to ~55 msec and depending on users not configuring
     HZ to permitted values of <= 18 Hz.  Even HZ = 100 provides a safety
     margin.  This method could also be used for the non-lapic case,
     using either another source of periodic interrupts to keep calling
     i82854_get_timecount() significantly more often than every 1/HZ seconds,
     or by using another source for hardclock interrupts.  On i386's, the
     RTC would work perfectly for clock interrupts too except for minor
     problems in schedulers and maybe applications wanting timeouts of
     exactly 10 msec.

- only 1 or 2 accesses are needed:
   - 2 with only the LSB of the count used.  This HZ to be larger than about
     5000.  Large HZ are undesirable in general but are sometimes good for
     dumb hardware like the i8254.
   - 1 with unlatched reads.  I could never get this to work.

>>> > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
>>> > of an idea.
>
> The main benefit was getting more precise timeouts, something we have
> at various times thought about implementing with deadline counters
> on platforms that have it.  Nobody has done it though.

Dragonfly did it.

> So, instead of looking for "quick fixes", lets look at this with a
> designers or architects view:
>
> On a busy system the scheduler works hundred thousand times per
> second, but on most systems nobody ever looks at the times(2) data.

More like 1000 times a second.  Even stathz = 128 gives too many decisions
per second for the 4BSD scheduler, so it is divided down to 16 per second.
Processes blocking on i/o may cause many more than 128/sec calls to the
scheduler, but there should be nothing much to decide then.

> The smart solution is therefore to postpone the heavy stuff into
> times(2) and make the scheduler work as fast as it can.

Once more: schedulers haven't used anything related to times(2) since
the ancient version of 3BSD or 4BSD where times() was superseded by
gettimeofday(), and have never used timecounters.  (Even times(2) doesn't
use anything related to scheduling except to fake 4BSD scheduler clock
ticks in its API.)

> So the scheduler should read the TSC and schedule in TSC-ticks.

Schedulers never read the TSC.  The schedule in statclock ticks.

> times(2) will then have to convert this to clock_t compatible
> numbers.

It has converted from real times to clock_t's since before FreeBSD-1.
The real times happen to be implemented using timecounters and the
timecounter may be the TSC.  times() doesn't really care.  OTOH,
getrusage() reports process times in real times (with only some
resolution lost by converting MD times to bintimes and then bintimes
to timevals).

> According the The Open Group, clock_t is in microseconds by means
> of historical standards mistakes.

clock_t in microseconds is required for historical mistakes in OS's
supported by The Open Group.  FreeBSD never had these particular
mistakes.  It has different ones, and has sysconf(_SC_CLK_TCK) fixed
at 128 to support them.  (Note that the units for clock_t are not the
same for all uses of clock_t, but for the historical times() mistake
they are 1/sysconf(_SC_CLK_TCK) seconds.  As an implementation detail,
FreeBSD uses 1/128 for all clock_t's even in cases where the historical
mistakes have less inertia.)

> However, I can see nowhere that would collide with an interpretation
> that said "clock_t is microseconds PROVIDED the cpu had run at full
> speed", so a simple one second routine to latch the highest number
> of TSC-tics we've seen in a second would be sufficient to generate
> the conversion factor.
>
> And in many ways this would be a much more useful metric to offer
> (in top(1)) than the current rubber-band-cpu-seconds.

You seem to have left out a "not" here.  Users mostly only care about
the real time taken by their processes.  If the conversion factor is
constant then it is possible for even users to apply it to convert from
the units displayed by top and friends to their favourite units, but
with variable conversion factors it would be difficult for even
applications to do the conversion.  Syscalls would have to return a
table giving their best idea of the conversion factors at different
times in the processes lifetime, and applications would have to
integrate over time to convert to a single number to display to the
user, according to user-specified weights.  Better yet, put the
integration in the kernel and use syscalls to tell the kernel the
weights ;-).

Anyway, getrusage() has fewer historical mistakes than times(), and
maintaining non-broken support for it requires using timecounters in
mi_switch() almost like we already do.  Hmm.  Checking the history
shows some anachronisms in what I said in the above.  It is only
necessary to go back as far as FreeBSD-1 to find a BSD where ticks are
used for getrusage() too.  In FreeBSD-1, there wasn't even an mi_swtch().
Context switches went directly to MD code in swtch() and swtch() was
missing calls to microtime()/bintime() and many other expenses.  The
bogusness in times() and getrusage() was sort of reversed -- getrusage()
(actually hardclock()) converted from low-resolution tick counts to
high resolution timevals and times() just returned the tick counts;
now getrusage() only uses the tick counts for dividing up the total
time and times() converts from the high-res units back to low-res ones
and ends up with less accuracy that it started with due to double
rounding.

So the current pessimizations from timecounter calls in mi_switch()
are an end result of general pessimizations of swtch() starting in
4.4BSD.  I rather like this part of the pessimizations...

Bruce