[PATCH] Statclock aliasing by LAPIC

Tue Jan 19 18:53:37 UTC 2010

On Tue, 19 Jan 2010, John Baldwin wrote:

> On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote:
>>
>> Well, the primary things I wanted to fix is not the hiding of
>> malicious programs but the clock aliasing created when handling all
>> the clocks by the same source.

I probably misdiagnosed the aliasing in a previous reply -- after the
one being replied to here -- please reply to the latest version --:
the problem for malicious programs seems to be sort of the opposite
of the one fixed by using a separate hardware clock for the statclock.
It seems to be the near-aliasing of the separate statclock that gets
short-lived timeout processes accounted for at all (but not enough
if there are many such processes).  A non-separate statclock won't
see these processes excessively like I first thought, even when the
statclock() call immediately follows the hardclock() call, since
hardclock() doesn't start any new processes; thus a statclock() at
the same time as a hardclock() is the same as a statclock() 1/hz-
epsilon after the previous hardclock() arranged to start a few
timeouts -- usually these timeouts will have finished.  A separate
statclock() is little better at seeing short-lived timeout processes,
since it has to sweep nearly uniformly over the entire interval
between hardclock() interrupts, so it cannot spend long nearly in
sync.  However, to fix the problem with malicious programs, except
for short-lived (short-active) ones started by a timeout which hopefully
don't matter because they are short-lived, statclock() just needs to
sweep not so uniformly over the entire interval, and this doesn't need
a separate statclock() -- interrupting at points randomly distributed
at distances of a large fraction of 1/hz should do.  This depends on
other system activity not being in sync with hardclock().

>> What I mean, then is: I see your points, I'm not arguing that at all,
>> but the old code has other problems that gets fixed with this patch
>> (having different sources make the whole system more flexible) while
>> the new things it does introduce are secondarilly (but still: I'm fine
>> with whatever second source is picked up for statclock, profclock) if
>> you really see a concern wrt atrtc slowness.
>
> You can't use the i8254 reliable with APIC enabled.  Some motherboards don't
> actually hook up IRQ 0 to pin 2.  We used to support this by enabling IRQ 0 in
> the atpic and enabling the ExtINT pin to use both sets of PICs in tandem.
> However, this was very gross and had its own set of issues, so we removed the
> support for "mixed mode" a while ago.  Also, the ACPI specification
> specifically forbids an OS from using "mixed mode".

I thought that recent changes reenabled some of this.  And what's to stop
some motherboards breaking the RTC too?

> My feeling, btw, is that the real solution is to not use a sampling clock for
> per-process stats, but to just use the cycle counter and keep separate user,
> system, and interrupt cycle counts (like the rux_runtime we have now).

The total runtime info is already available (in rux_runtime).  It is
the main thing that we use to see that scheduling is broken :-) -- we
see that the runtime is too large or small relative to %CPU.  I think
using this and never using ticks for scheduling would work OK.  Schedulers
shouldn't care about the difference between user and sys time.  Something
like this is also needed for tickless kernels.

With schedulers still wanting ticks, perhaps the total runtime could
be distributed as fake ticks for schedulers only to see, so that if
the tick count is broken schedulers would still get feedback from the
runtime.  And/or processes started by a timeout could be charged a
fake tick so that they can't run for free.

Interrupt cycle counts are mostly already kept too, since most interrupt
handlers are heavyweight and take a full context switch to get to.

However, counting cycles to separate user from sys time would probably
be too inefficient.  A minimal syscalls now should take about 200 cycles.
rdtsc on Athlon1 takes 12 cycles.  rdtsc on Core2 and Phenom takes 40+
cycles.  2 of these would be needed for every syscall.  These would only
not be too inefficient if they ran mostly in parallel.  They are
non-serializing, but if they actually ran mostly in parallel then they
might also be off 40+ cycles/call.

> This
> makes calcru() trivial and eliminates many of the weird "going backwards",
> etc. problems.  The only issue with this approach is that not all platforms
> have a cheap cycle counter (many embedded platforms lack one I think), so you
> would almost need to support both modes of operation and maybe have an #define
> in <machine/param.h> to choose between the two modes.

Not the only problem.  This also doesn't work for things like vm statistics
gathered in statclock().  You still need statclock() for these, and if you
want the statistics to be reasonably accurate then you need a sufficiently
non-aliased aliased and non-random random statclock().

> Even in that mode you still need a sampling clock I think for cp_time[] and
> cp_times[], but individual threads can no longer "hide" as we would be keeping
> precise timing stats.

Not so much a problem as the vm stats -- most time-related statistics
could be handled by adding up per-thread components, if we had them all.

If we had fine-grained programability of a single timer, then accounting
for threads started by a timeout would probably be best implemented for
almost perfect correctness and slowness as follows:
- statclock() interrupt a few usec after starting a timeout
- then periodic statclock() interrupts every few tens or hundreds of usec
   a few times
- then back to normal periodic statclock() interrupts, hopefully not so
   often
All statistics including tick counts are a weighted sum depending on
the current stathz (an integral over time, like now for the non-tick
count stats, except with the time deltas varying).  This would be slow,
but it seems to be the only way to correctly account for short-lived
processes started by a timeout -- in a limiting case, all system
activity would be run as timeouts and on fast machines finish in a few
usec.  Maintaining the total runtime, which should be enough for
scheduling, doesn't need this, but other statistics do.  Other system
activity probably doesn't need this, because it is probably started
by other interrupts that aren't in sync with hardclock() -- only
hardclock() combined with time^callout sematics gives a huge bias
towards starting processes at particular times.  Probably nothing needs
this, since we don't really care about other statistics.  Probably
completely tickless kernels can't support the other statistics.

Bruce