kern/108954: 'sleep(1)' sleeps >1 seconds when speedstep (Cx) is in economy mode

Fri Feb 9 05:10:26 UTC 2007

The following reply was made to PR kern/108954; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: Brad Huntting <huntting at hunkular.glarp.com>
Cc: FreeBSD-gnats-submit at freebsd.org, freebsd-bugs at freebsd.org
Subject: Re: kern/108954: 'sleep(1)' sleeps >1 seconds when speedstep (Cx)
 is in economy mode
Date: Fri, 9 Feb 2007 16:08:49 +1100 (EST)

 On Thu, 8 Feb 2007, Brad Huntting wrote:

 >> Description:
 > 	On some machines (those supporting Intel speedstep),
 > 	nanosleep(2) (and presumably select(2)) are confused by cpu
 > 	frequency changes and wind up over sleeping.

 Do they work without the lapic timer?  (Not configuring "device apic"
 is the only easy way to avoid using the lapic timer.  I forget if acpi
 can work without apic.)  On some systems, the lapic timer doesn't work
 at all because the CPU enters a deep sleep on the hlt instruction in
 the idle process, and one workaround is to run other timers at a higher
 frequency than the lapic timer frequency to kick the CPU out of its
 deep sleep and thus keep the lapic timer interrupting.

 >> How-To-Repeat:
 >
 > 		/bin/sh -c 't0=`date +%s`; sleep 1; t1=`date +%s`; expr $t1 - $t0'
 >
 > 	On a normal machine this should almost always spit out '1'.
 >
 > 	On a Centrino or Pentium-M based laptop (such as the Panasonic
 > 	CF-W4), with hw.acpi.cpu.cx_lowest set to something other
 > 	than C1, this produces '4' or '5'.
 >
 > 	Note:  If you can reproduce this, _please_ post a follow
 > 	up so I know I'm not insane.
 >
 > 	The problem seems to be that when 'sysctl hw.acpi.cpu.cx_lowest'
 > 	is set to anything other than 'full speed' (aka 'C1') the
 > 	cpu frequency is generally (and unpredictably) slower than
 > 	C1 speed.  tvtohz(9) (located in /sys/kern/kern_clock.c)
 > 	assumes a static frequency and so returns several times the
 > 	correct number of tics.

 The frequency used by tvtohz() is required to be fixed.  Since it is
 used mainly for timeouts, the frequency isn't required to be very
 accurate, but it should be accurate to within a few percent and not
 wrong by a factor of 5.

 > 		$ sysctl hw.acpi.cpu dev.cpu.0.freq_levels kern.timecounter.choice kern.timecounter.hardware
 > 		hw.acpi.cpu.cx_supported: C1/1 C2/1 C3/85
 > 		hw.acpi.cpu.cx_lowest: C3
 > 		hw.acpi.cpu.cx_usage: 0.00% 13.11% 86.88%
 > 		dev.cpu.0.freq_levels: 1200/-1 1100/-1 1000/-1 900/-1 800/-1 700/-1 600/-1 525/-1 450/-1 375/-1 300/-1 225/-1 150/-1 75/-1
 > 		kern.timecounter.choice: TSC(800) ACPI-fast(1000) i8254(0) dummy(-1000000)
 > 		kern.timecounter.hardware: ACPI-fast

 The timecounter is not really involved here.  It is only used to check
 the time (not quite correctly) after the timeout.  That would fix avoid
 the problem if the timeout is too short but not if it is too long.

 >> Fix:
 >
 > 	The ideal solution would be to use a clock who's frequency
 > 	is not jerked around by speedstep.  Perhaps this is just a
 > 	hardware bug, but seem to recall seeing this behavior on
 > 	my previous Intel Centrino based laptop as well.

 The i8254 timer (not timecounter) is supposed to have this property.
 Maybe the lapic timer doesn't.

 > 	Fixing nanosleep(2) (and select(2)) alone would be relatively
 > 	easy:  Since they loop, returning to the user only when the
 > 	correct wakeup time has arrived (microtime(9) is apparently
 > 	not affected by this problem), one could just have tvtohz(9)
 > 	return the number of ticks based on the _lowest_ cpu frequency
 > 	rather than the _highest_.  Unfortunately, this makes other
 > 	users of tvtohz(9) wake up early, and they may not be as
 > 	prepared to handle this.

 Yes, that should be OK as a workaround.  One of the things that
 nanosleep() etc. don't do quite right is related: for very long sleeps,
 the calculated timeout may be more than 1 tick too long due to clock
 drift or just the limited resolution of the scale factor used in
 tvtohz().  That should be handled by using the _lowest_ possible scale
 factor rather than the nominal one.  This could also be used to ensure
 that the final timeout is minimal (tvtohz() rounds up and then adds 1
 to ensure that the timeout is long enough, so an average timeout is
 1.5 ticks longer than strictly necessary; by not adding 1 but checking
 whether the timeout has expired on waking up, it is possible to make
 an average timeout only 0.5 ticks longer than necessary).

 There should be a new interface for callers that are prepared to handle
 this (or they can subtract 1 and rescale).

 Waking up early also wastes time so it shouldn't usually be done.

 Bruce