[PATCH] Statclock aliasing by LAPIC

Tue Dec 1 15:30:15 UTC 2009

On Mon, 30 Nov 2009, John Baldwin wrote:

> On Friday 27 November 2009 6:42:50 pm Attilio Rao wrote:
>> Handling all the three clocks (hardclock, softclock, profclock) within
>> the LAPIC can lead to aliasing for the softclock and profclock because
>> hz is sized to fit mainly hardclock. The fashion to handle all of them
>> within the LAPIC was introduced in 2005 and before than the softclock
>> and profclock were supposed to be handled in the rtc. Right now, too,
>> there is the necessary support to handle profclock and statclock in
>> atrtc which gets enabled if the LAPIC signals it can't take in charge
>> the three clocks.
>> The proposed patch reverts the situation preferring the atrtc to
>> handle the statclock and profclock (then a different source from the
>> LAPIC) and then avoid the aliasing problem:

This would defeat most of the point of using the lapic timer.  RTC
interrupts are too slow to use for anything if there is an alternative
like the lapic timer.  i8254 interrupts are not so bad, and in fact
are just as efficient as lapic timer interrupts iff they are controlled
by the APIC and not by the ATPIC.  This is because RTC interrupts must
be acked and tested for in RTC registers, and the RTC is on the ISA
bus so accessing it is very slow, while the i8254 is programmed for
its interrupts to not need any acking or testing.  RTC and i8254
interrupts may also be be controlled by the ATPIC, and then the ATPIC
must be acked on the ISA bus too.  This gives the following number of
ISA bus accesses for most interrupts:

device          read    write
------          ----    -----
lapic_timer        0        0
i8254              0    0+0/1
RTC (current)      1    0+0/2
RTC (old)          3    1+0/2

Here "+0/1" and "+0/2" are for the ATPIC ack, if any.  RTC (old) is
before I optimized rtcin() to not write the index register in the usual
case where it has not changed (writing the index register takes 1 extra
write and uses 2 dummy reads in an attempt to satisfy timing requirements).
However, there is apparently broken (or just incompatible) hardware
that fails with this optimization.  There would probably be more reports
of this brokenness if using the RTC became the default again.

The 4-6 ISA accesses for RTC (old) take about 4-9 usec, so using the
RTC at stathz = 128 Hz takes only 0.05-0.12% of 1 CPU, which is
acceptable.  Using the RTC at profhz = 1024 Hz takes 0.4-0.9% of 1
CPU, which may also acceptable, but profhz = 1024 was too slow even
for a 386/20 in 1993; it should be 200-1000 times larger now, but the
RTC just can't support that, and one reason it was never increased is
that the RTC is too inefficient.  Profiling can now be implemented
better using the lapic timer, but using the lapic timer currently
implements profiling slightly worse than using the RTC.

> http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing3.diff
>> 
>> In this patch, lapic_setup_clock() has been changed in order to return
>> a three-states variable which identified if the LAPIC got in charge
>> all the three clocks, just the hardclock or none of them (the current
>> situation is just none/all) and the rtc handling runs subsequently.
>> A tunable as been added to force LAPI to get in charge all the three
>> clocks if needed.
>> In ia32 atrtc compiling is linked to atpic compiling, so a compile
>> time flag has been added to check if atpic is not present and in case
>> force LAPIC to take in charge all the three clocks (which is still
>> better than the 'safe belt values' still present in the rtc code).

I don't like tunables, especially to switch from one bug to another.
This can be done better using sysctls only, since it is not needed
for booting.  The sysctls would need to be runnable at any time, but
reprogramming the lapic timer at any time is already needed to
fix profiling (cpu_start/stopprofclock() are missing support for
the lapic timer; instead, the default lapic_timer_hz is set excessively
large but not large enough for a good profhz).  sysctls also let you
test this stuff without rebooting.

>> Please note that statclock and profclock are widely used in our kernel
>> (rusage is, for example, statclock driven) and fixing this would
>> result in specific improvements (as a several-reported wrong CPU usage
>> statistic in top).
>> This bug has been found by Sandvine Incorporated.

What bug exactly?  Bugs like this must have been found before 1993,
since statclock() in 4.4BSD was supposed to fix them.  See "A Randomized
Sampling Clock for CPU Utilization Estimation and Code Profiling"
(ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z).  FreeBSD never
implemented the "Randomized" part, but its statclock() used to sort
of work, since by default stathz was > hz and was not nearly a multiple
of hz.  Someone broke the former by increasing the default hz to 1000.
This allowed malicious programs to easily hide themself from statclock()
while consuming a large fraction of CPU cycles (when stathz was > hz,
it was not so easy to hide, and very difficult to both hide and consume
significant CPU, since timeout granularity makes it hard to control
wakeups).  Then using the single lapic timer to generate all periodic
timer interrupts increased the synchonization of these interrupts,
thus moving further from a randomized statclock().  However, the
defaults with the lapic timer give an even larger beat frequency than
before, so I don't see how using the lapic timer can increase the
problem much.  (The beat frequency of (1000, 128) is 16000.  The beat
frequency of (1000, 133) is 133000.  The latter means that, with
defaults, statclock and hardclock() ticks are only perfectly synced
once in every 133 seconds.  Misconfiguring hz to a multiple of 128 can
give perfect synchronization, which may be a more of a problem, or a
fix -- see below).

>>
>> Reviews, comments and testing are welcome.

Review of the part of the patch visible in the mail:

.

> Presumably in the RTC case lapic_timer_hz should always be hz and not some
> multiple of hz.

Sure.  Except the allocation of the timers is backwards at best.  You
need profhz on the most efficient timer so that it can be very large
(other changes are required for large profhz to actually work).  You
want stathz on the next most efficient timer so that it can be larger
than hz (see above) (other changes are needed for a stathz much larger
than 128 to actually work.  Note that at least SCHED_4BSD wants a
scheduling clock frequency of much less than 128 -- it essentially
divides stathz by 8 to get this.  Scaling in calcru() is currently
broken after several hundred days of runtime, and would break sooner
with larger stathz).

Perhaps your recent changes (that removed the literal constant dividers)
made the synchronization problem worse.  But these changes make it
easy to implement any number of independent timers with optional
randomness using the lapic timer.  E.g., to randomize statclock(), just
add a small random value (+-) as well as stathz.  Note that statistics
utilities won't like this -- some like systat(1) use statistics ticks
as a timebase so they want statclock() to be perfectly periodic.

I don't worry about the synchronization or broken profiling, and use
lapic_timer_hz = profhz = stathz = hz = 100 whenever the lapic timer
is used.  I haven't noticed any problems caused by this (mostly using
SCHED_4BSD), except the unavoidable one that hz = 100 gives less
accurate usr/sys decomposition than does hz = 1000.  I have noticed
that this fixed the cosmetic problem that systat(1) shows glitches in
the lapic timer interrupt rates:  Although using the lapic timer for
all timer interrupts makes them all perfectly periodic, systat cannot
see this because stathz = 133 is too small a sampling rate and is not
an exact divisor of lapic_timer_hz -- it caused a glitch every
lapic_timer_hz/stathz seconds.  For other interrupts, we wouldn't
expect the rates to be constant, but we know that the lapic timer
interrupt rate is constant so we know that the oscillation of its
displayed value is a bug.  Right now on ref8-i386.freebsd.org, I see
the values not oscillating much but being weird: for cpu0-1, they are
near 1973; for cpu2-3, they are near 1981; for cpu4-5, they are near
2043, and for cpu6-7 they are near 2003.

A tickless kernel would need to at least consider running the scheduler
and statistics gathering on most context switches (unless it keeps using
ticks when not idle).  The scheduler parts of this would also fix
timer synchronization problems for !tickless kernels, but I don't see
how they can be as efficient as only considering scheduling at
infrequent tick intervals.

> Also, did you check to make sure all the lock is present?  I
> think at one point I changed the locking for the RTC and/or ISA timer to just
> use critical_enter/exit so that UP machines running an SMP kernel wouldn't pay
> the locking overhead since the code was only used on UP machines.  It may very
> well be that I only changed that in a development branch though and not in
> HEAD.

I don't remember any locking changes for RTC ever being committed.
rtcin() still uses mtx_lock_spin(&clock_lock).  clock_lock is the i8254
clock's lock, and is still abused for the RTC.  This abuse was convenient
when the RTC driver was implemented in the same file as the i8254
driver, but now the RTC driver is in its own file.  The i8254's private
variable `clock_lock' is even declared in the RTC driver's public
header, with other style bugs of course.

Bruce