cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c

Tue Oct 18 10:32:24 PDT 2005

Scott Long wrote:
> Andrew Gallatin wrote:
>> Nice.  This reduces lmbench context switch latency by about 0.4us (7.2
>> -> 6.8us), and reduces TCP loopback latency by about 0.9us (36.1 ->
>> 35.2) on my dual core 3800+
>>
>> It is a shame we can't find a way to use the TSC as a timecounter on
>> SMP systems.  It seems that about 40% of the context switch time is
>> spent just waiting for the PIO read of the ACPI-fast or i8254 to
>> return.
> 
> The TSC represents the clock rate of the CPU, and thus can vary wildly
> when thermal and power management controls kick in, and there is no way
> to know when it changes.  Because of this, I think that it's
> practically useless on Pentium-Mobile and Pentium-M chips, among many
> others. 

This is a myth.  It is not so dismal as you portray and cpufreq(4) gives 
both the kernel and userland a way of getting the necessary info in an 
MI way (including notification of clock rate changes) and control it 
when possible.  There are a number of mechanisms actually in the world 
today:

* SMM-based clock switching: most laptops have SMM code (i.e. BIOS) that 
checks the power line status on boot and sets the base clock rate.  They 
use the standard platform mechanism (i.e. enh speedstep, speedstep-ich) 
to set the frequency and cpufreq(4) allows the user or kernel to freely 
override it at runtime.  All that is left to do is for timecounters to 
export a "re-calibrate" option that works at runtime and for cpufreq(4) 
to call it when the frequency is changed by the kernel/usermode.  bde@ 
supplied some code I hope to import soon once I have it well tested that 
implements such a runtime calibration, although it is just used 
internally by cpufreq(4), not hooked into timecounters at the moment. 
Note that no BIOS I know of actually changes the value after boot, so 
TSC is reliable unless we change it ourselves.

* p4tcc:  thermal control circuit.  Version 1 does x/8 throttling of the 
CPU by an internal stop clock cycle, where "x" is an integer.  Version 2 
also can step the clock rate via enh speedstep.  There are two parts to 
this, the platform (BIOS) setting and "on demand" (kernel) setting.  The 
OS can use the on demand setting via cpufreq(4) to save power or for 
passive cooling.  We initiate this ourselves, so once the timecounter 
interface can accept an updated calibration, there is no issue here. 
The platform setting is worse in that we don't know when it kicks in. 
However, it is intended as an emergency measure like if a fan dies.  All 
known BIOSen set this value just below the thermal shutdown circuit 
(i.e. the processor stops operation completely).  As such, this is an 
edge case that we do not have to handle particularly efficiently.  It 
suffices to periodically check the calibration of TSC (perhaps every 10 
seconds?) via the ACPI timer and update our settings if it has changed. 
  Since cpufreq(4) knows all the possible settings, it suffices to just 
measure the clock rate and compare it to a table of valid settings. 
There is no ambiguity (yet) since every CPU control mechanism has 
discrete settings.

> There is also the issue of multiple CPUs having to keep their
> TSC's somewhat in sync in order to get consistent counting in the
> system.  The best that you can do is to periodically read a stable
> counter and try to recalibrate, but then you'll likely start getting
> wild operational variances. 

> It's a shame that a PIO read is still so
> expensive.  I'd hate to see just how bad your benchmark becomes when
> ACPI-slow is used instead of ACPI-fast.

ACPI-slow should not be used at all.  If the acpi timer is unreliable, 
use a different one.  Also, I think most systems that had unreliable 
acpi timers were older and not likely to have variable CPU clocks.  So 
I'd prefer TSC on such systems anyway.

> I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
> of an idea.  Having preemption in the kernel means that ithreads can run
> right away instead of having to wait for a tick, and various fixes to
> 4BSD in the past year have eliminated bugs that would make the CPU wait
> for up to a tick to schedule a thread.  So all we're getting now is a
> 10x increase in scheduler overhead, including reading the timecounters.

I use hz=100 on my systems due to the 1 khz noise from C3 sleep. 
Windows has the same problem.

-- 
Nate